Wednesday Nov 07, 2007

Processors and Performance : Chips, MIPS, and Sizing blips..

The following post is a close proximity of an article that I published in this month's Sun "Technocrat" (November 2007) issue. Hopefully you'll enjoy this discussing regarding the past and present relationship of CPU's and architecture to system performance.

In today's fast paced world of ever increasing demands for system throughput, the foundation of discussion and expectations typically all hinge upon the same topic.. CPU performance.   This article will be an examination of CPU's and system architecture, as they relate to performance and capacity planning as a whole.   From our last discussion "The Many Flavors of System Latency ..." , we will extend the context to focus on past and present competing aspects of system/CPU architecture,  including a brief history of how we got to the current competitive landscape we find ourselves in today.  (The photo to the left is of a Sun T1 "coolthreads" 8 core, 32 thread cpu)

CISC vs. RISC

Going back to the early days of microprocessor design (and also likely a familiar topic from your Computer Science Bachelor's curriculum), you will recall much conversation, speculation, and competition surrounding two competing approaches to CPU architecture :  CISC vs. RISC.

In a nutshell, CISC (Complex Instruction Set Computers) designs emphasize the use of "complex" instructions within the HW to minimize the amount of Assembly code (SW) required.  Other benefits are that compilers don't need to be as complex, as well as CISC CPU's requiring less RAM to store instructions.  However, this approach sometimes requires more than one clock cycle of a processor to complete processing a complex instruction.

RISC (Reduced Instruction Set Computers), just as the name refers, offer a reduced number of simple instructions that can complete execution within a single clock cycle, but which might require multiple instructions for a complex operation such as "multiply".  The RISC approach has the nice side-effect of also requiring less chip "footprint" (in # of transistors, etc..)  reserving more die space for memory registers, while also typically offering streamlined execution within the same window of execution time as CISC counterparts.  Modern compilers (such as Sun's Studio 12 line) offer significant performance benefits that should not be overlooked, especially critical when developing and compiling binaries on RISC based architecture (Sun has seen performance optimized benefits of 200+ % when using the latest releases of Sun Studio Compilers vs. generic "gcc" compiled code).

Today, the market is segmented in more or less the same 2 camps, but they are divided down the lines of "x86" compatible CPU's (modern CISC designs from the Intel and AMD) and the modern RISC competition from Sun (SPARC) and IBM (Power), where HP and DEC (now gone altogether) have stepped back from RISC manufacturing.

Is there more to Moore's Law ?

For nearly the past 30+ years, we have seen processor performance double according to Moore's Law (approximately every 2 years), a movement from Uniprocessor based systems to those that required scalability vertically, into multi-processor systems that we've become familiar with in the Unix world, better known as SMP (Symmetric MultiProcessor) systems.   It's funny that Gordon Moore made his original claim in a 1965 article of electronics magazine originally regarding the trend of Integrated Circuit component counts doubling every year, only loosely predicting this trend might continue until 1975, though he was uncertain that any future projections could be made (he later changed his prediction in 1970 to "doubling every 2 years", which has held relatively steady ever since).   Since past trends of IC / CPU transistor counts have closely correlated to CPU performance, the association of Moore's Law to performance  was made.

In order to accommodate this rapid rate of growth in processor performance, the number of transistor's contained within a CPU has been just one of the key characteristics that have climbed (in addition to clock speeds, etc..).   Amazing as it may sound, but 45nm (nano-meter) manufacturing fabrication is expected by Intel and others hopefully in 2008 (when just 10 years ago we had 500nm manufacturing) .  Even at the levels that we were at in 2004 with 90nm manufacturing, the width of a transistor (50nm across) was 1/2 the diameter of a single celled influenza virus !   Given that we are approaching atomic dimensions of gate thickness (at < 1nm), the pace of transistor density and clock frequency increases (using current designs) appears to be approaching the physical limitations of manufacturing.  In addition to these concerns, power and related heat issues have become very pronounced in today's "green" computing campaigns.  Luckily, Sun has dealt a hand worthy of industry recognition that includes CPU innovations to keep us progressing forward, however addressing more than simple "transistor counts", but more aptly the efficiency of moving an SMP  like architecture "onto" a piece of silicon, hence.. a system on a chip (which is essentially what we have with the T2).

How many ways can you weave those THREADS ?

SMP systems required an Operating System kernel that had a means of fairly sharing the CPU resources among the processes awaiting execution within the system (via priority-based scheduling classes).  This capability within Unix for "time slicing" between "runnable" processes and available kernel and physical CPU resources on systems hinges on the kernel dispatcher associating processes with "light weight processes" that in turn get bound to kernel "Threads", in order to be run within the "context" of physical processor registers and an execution pipeline (aka, HW Threads).  For the most part, this is how today's Solairs OS functions, along with the appropriate dose of preemption and locking mechanisms.

Over the past decade, and along with the increased demands of internet traffic (and associated application workloads), applications have gradually become better able to scale vertically within systems, primarily through the use of SW multi-threading.   Within a multi-threaded application, many threads of execution can run simultaneously across available CPU's within a system, allowing for an application to scale as close to "linearly" as possible (doubling the application throughput as the # of available CPU's doubled).

    THREAD (of Execution) :
  • noun          One of many software "threads" of execution that can be processed simultaneously on a computer system.
    Linear Scalability :
  • noun          The ability to increase system performance (throughput) at the same rate that resources (cpu's,..)are added.

CPU's and Memory :  UMA and NUMA

Modern computing systems also offer a "shared memory" model that has very specific performance and latency related characteristics, depending upon the system design (physical system interconnect type [bus/crossbar], memory management controllers, proximity to physical cache/RAM, etc..), the Operating System Kernel memory management (memory mgmt libraries, JVM Garbage collection, etc..), as well as the Application SW execution characteristics and memory requirements (cache hit ratio's [Instruction vs. Data Cache], TLB miss characteristics, RAM requirements, ..).

Among the modern types of vertically scalable parallel computer architectures available for multiprocessor systems, one of the most common high performance designs has become NUMA/ccNUMA (Non Uniform Memory Access / cache coherent NUMA).  Sun's E25K systems fall within this category, as do most large systems that offer vertical scalability of many independent CPU / Memory boards, acting together and communicating across a shared system interconnect (backplane/centerplane/crossbar).

Previously, local proximity of memory and the latency associated with accessing that memory was uniform and predictable.  With the advent of system growth beyond single system (cpu/memory) boards, UMA (Uniform Memory Access) could no longer be guaranteed.  One common performance issue that must be addressed within large NUMA environments is the aggregate impact of both physical memory proximity, alongside the gap between processor speed and memory latency (see below).   Solaris addressed this issue (with a Solaris 9 update) by introducing MPO (Memory Placement Optimization) that associates memory physically closer to a cpu to minimize the additional cross interconnect memory latency (this couples Solaris and the underlying HW along with Cache Coherency within ccNUMA architectures).  Note, other optimizations have been used to address large Sun Enterprise system centerplane latencies, including kernel cage splitting, removal (if DR isn't required), and the introduction of S10 enhancements.   <\*"busstat" can be used to diagnose these issues\*>

Lucky for us, we are quicly approaching system architectures that once again allow for UMA designs, offering memory access predictability (with the T2 and related offerings moving forward, where density of cpu cores/threads brings greater multithreaded capacity within a small footprint).  However, the current state of the industry isn't quite as lucky if you look across the bow of the competition, and within our client production environments.

The growing CPU - Memory gap ...


As you can see from the diagram to the left, over the past several years as microprocessor design has moved rapidly, doubling the performance and clockspeed of CPU's roughly every 2 years, the same increases have NOT been matched within the Memory arena.  This "wait" time that threads of execution must incur, coupled with the additional latency required for electricity to travel greater distances to access memory across the system interconnect (not physically local to a CPU, but rather on another CPU board or memory bank) can impact processor efficiency and overall system / application throughput dramatically.   (this slide and the next are from http://www.OpenSparc.net )

Chip Multi-Processing + Hardware Multi-Threading = Chip Multi-Threading

Realizing the implications of the memory latency "lag" (and somewhat against the tide of relying upon ever-increasing CPU clock speed increases), several years ago Sun made the decision to address this with an acquisition of Afara Websystems to bolster it's processor lineup and chart the industry's new course toward multi-core CPU's.


From the diagram on the left, it can be seen that Sun's adoption of both CMP (US-IV/+), aka Chip level Multi-Processing (having multiple cores per cpu, each with an execution pipeline), alongside HMT .. Hardware Multi-Threading (adding a multi-threaded execution pipeline within a core/cpu), shores up and nearly eliminates the issue of idle CPU cycles waiting on memory operations (by offering many physical threads of simultaneous execution within a CPU).  This is the foundation of what Sun calls CMT (Chip Multi-Threading), which is reflected in Sun's CPU roadmap for both the Niagara (T1 and T2 CPU's), Rock CPU, as well as the Sun/Fujitsu Olympus CPU (as a follow-on to the US-IV+).

Why so much $cache ?

In order to further minimize system latency and ensure peak performance, modern architectures include Memory Management and CPU memory access mechanisms "on-chip", such as the L1 / L2 cache and MMU located in Sun's CMT products.  Below is a block diagram of Sun's latest "T2" SPARC Core Architecture (based on the 64 bit SPARC V9 instruction set).

    Critical components for optimal kernel CPU / memory performance:
          • Level 1 (L1) Data Cache / Instruction Cache ..
          • Level 2 (L2) Secondary Cache, on-chip and shared for Sun CMT processors
          • Level 3 (L3) <only Sun's US-IV+ cpu offers L3 cache on-chip>
          • I-TLB (L1 / L2) Instruction -Translation Lookaside Buffers
          • D-TLB (L1 / L2) Data -Translation Lookaside Buffers
          • Buffer cache (Filesystem Kernel "page cache", taken from the "free list" of available RAM; this is a kernel structure)
Note:  The kernel memory page "size" is determined in part by the CPU chosen, since x86/x64 CPU's
                (Intel/AMD) offer 4K pagesizes, while the UltraSparc I through IV... offer 8K / 64K /
                512K / 4MB page sizes.   (\*monitor with the pagesize, trapstat, cpustat, and pmap ..\*)

Examining the key attributes of your system's Workload as Requirements  :

Once again, in order to select the appropriate architecture for a production deployment, the all-inclusive entity that we need to examine in it's entirety is the "Application Environment", all of it's subsystems, as well as individual /concurrent workload characteristics :
  • Single Threaded vs. Multi-Threaded Applications (OLTP, DSS, HPC, Web/AppSvr, .. ?)
  • Compute / HPC Intensity (MPI, horizontally scaled compute farm requirements, ..)
  • Network I/O (long-lived connections vs. short-lived; large vs. small packets, # inbound RX pkts/sec..)
  • Memory Intensive Workload (shared / distributed memory req's, etc)
  • Storage Workload (R/W %'ages, Cache cfg, # Controller Interrupts/sec, # Files opened, shared FS,etc.)
  • Integer vs. Floating Point Calculations (T1's are not well suited to FP workloads)
  • 32 vs. 64 bit needs/benefits (address space needs beyond 4GB RAM ?)
  • Do SLA/SLC reqmt's focus on Throughput, BandWidth (IO/Net/Mem), Availability, and/or Response Times ?


Choosing the right CPU for your Workload :

Sun SPARC based CMP / CMT CPU's :

  • US-IV+ :  Well suited for very large vertically scaled configurations where the 32MB of L3 cache makes a big difference, such as DB Servers (many of these environments have large single threaded processes, including batch/OLTP).  Sun's E25K systems scale up to 72 CPU's per single domain (144 cores).
  • Sun / Fujitsu SPARC64 (US-VI Olympus) :  Follow-on to the US-IV+ CPU.  (~1.5\* performance of a US-IV+)  This CPU should have higher clockspeeds starting at 2.15GHz, with an additional HW thread/core, but no L3 cache on-chip.  Note that the up-coming CMT ROCK chip from Sun in 2008 will fall within this segment.
  • T1 (Niagara 1) :  Well suited for small to medium sized Multi-Threaded Workloads (that don't have much /any FP processing.  Best for : WebSvrs, AppSvrs, DNS, etc..).  Each single socket system offers up to 32 HW Threads of execution.
  • T2 (Niagara 2) :  Well suited for medium to large Multi-Threaded Workloads.  These systems should also offer good general purpose computing performance, given the addition of FGU's per CPU core, along with built-in 10G Ethernet, etc..  Each 5x20 system presently offers a single socket with up to 64 HW Threads.  Look for multi-socket systems based up on the T2 cpu in the not so distant future ;)

Sun's "World Class" T2 (Niagara 2) CPU :

At a glance, the new T2 CPU offering from Sun is a true "system on a chip" that lives up to it's reputation as "the worlds fastest processor" (the new world record benchmarks listed further down in this article can attest to the validity of that statement and show how Sun's latest CMT CPU's are changing the landscape of computing efficiency, part of the reason why Sun calls this "CoolThreads" and/or "Throughput" computing).

T2 CPU Highlights :
    • 8 cores \* 8 Threads each = 64 Threads of Execution
    • 65nm, initially running @ 1.4 GHz
    • 8 Floating Point / Graphics Units (FGU's, one per core)
    • on-chip Crossbar providing : 180 GB/s R + 90GB/s W
    • built in 2\* 10Gb Ethernet, MMU, Encryption, etc...

The following table provides a high level comparison of Sun's T2 and T1 CPU's  :

(for the complete Microprocessor Review report on the T2, click here)


(For further details comparing the T2 to other recent Sun CPU's for #transistors, etc., click here)
(For photos inside the new Sun T5x20 systems, based upon the T2 cpu, click here)

Solaris kernel (CPU related) Performance Metrics and Utilities :

The following table is only listed as a "high level" sample of common metrics frequently used as part of Solaris CPU-related performance analysis.  This is by no means a comprehensive list of metrics available, but rather an introduction for those that aren't familiar with the essentials.   An up-coming set of blogs will include much more detailed examples with command line output, also including discussions of kstat and Dtrace visibility available.

Note: \*  vmstat, cpustat, trapstat, intrstat  reflect system-wide statistics, while  mpstat, cputrack reflect per CPU statistics. \*

Metric
             Description  
Utility
Run Queue
 Kernel Threads Runnable, but not executing   (best if 0, or at most < # cores)
vmstat (r)
Blocked Kthr
 Blocked Kernel Threads             (typically ID's an IO bottleneck, see also %wt, lockstat,.. )
vmstat (b)
System Calls
 Number of System Calls (calls made into the OS kernel, accounting towards %Sys)
vmstat (sys)
Interrupts
 Number of System Interrupts per interval (interrupts have the highest priority on the system)
vmstat (in)
% CPU (U/S/I)
 % CPU utilization (% User space / % System kernel / % Idle);  % User  should be  2\*  % Sys
vmstat
Cross Calls
 Per CPU Cross-Calls (either for cross processor interrupts, and/or maintaining cpu virtual memory translation consistency .. aka cache consistency with MME and mapping TLB entries, etc.)
mpstat (xcal)
Cpu Interrupts
 Per CPU Interrupts   (also use intrstat, as well as lockstat for system correlation)
mpstat (intr)
Context Switches
 Involuntary context switching (icsw reflects preemption..) vs. voluntary context switching (csw)
mpstat (i/csw)
CPU Migrations
 Per Cpu Migrations .. A more inclusive migration off of and onto another CPU.
mpstat (migr)
Shared Mutex
 Mutex exclusion lock activity (per cpu)  p/lockstat gives the best visibility of this activity.
mpstat (smtx)
% CPU Waiting
 % of a single CPU spent Waiting (during the sampling interval).  See also  b kthr.
mpstat (%wt)
 Instr TLB Misses
 % of MMU related Instruction Translation Lookaside Buffer Misses (see also  pagesize, pmap, cpustat..)
trapstat -t/T
 Data TLB Misses
 % of MMU related Data Translation Lookaside Buffer Misses (see note above as pgsize is related)
trapstat -t/T
  CPU Counters
VARIOUS CPU specific HW event counters (Cache, Instruction level, FP, TLB; man cputrack for your HW specific counters available)
cputrack
  CPU Counters
 VARIOUS System Wide CPU event Counters (man cpustat for your HW specific counters)
cpustat
  BUS Statistics
 Available System Specific Bus Device / Instance Counters & Events (use busstat -l  for your HW)
busstat
 kernel Statistics
 ALL kernel statistics are available individually via kstat  (module:instance:name:class)
 kstat

\*\*NOTE: if you'd like to try a single Solaris utility that can run in minutes to automate the performance / workload correlation and reporting for you, take a look at sys_diag if you haven't already done so already (or the README).  It includes both high-level (vmstat, mpstat, iostat, netstat, kstat, ...) snapshot and analysis, as well as Deep analysis mode which includes extended Dtrace /dexplorer and lockstat probing.   (all output is summarized and color coded in an HTML report header/ Dashboard with a Table of Contents for analysis details) \*\*


Common CPU Benchmarks and What they mean :

The following list provides a set of definitions and examples for some of today's most common independent (industry accepted) computing benchmarks.

  • SPEC CPU2006  :  CPU-intensive benchmark suite, stressing a system's processor, memory subsystem and compiler. SPEC designed CPU2006 to provide a comparative measure of compute-intensive performance across the widest practical range of hardware.  This benchmark suite includes both the SPEC int_rate2006 and SPEC fp_rate2006 benchmark tests.
NOTE that SPEC is an independent, non-profit 3rd party benchmarking organization, providing the comparative examples that follow.


IBM System p570
HP ProLiant DL360 G5
SPECint_rate2006
78.5
60.9
61.3
SPECfp_rate2006
62.3
58
38.8

NOTE:  For the benchmark example above, and those that follow, this shows how Sun's latest CMT T2 cpu is changing the landscape of computing efficiency, part of the reason that Sun calls this "CoolThreads" and/or "Throughput" computing.
  • SPEC jbb2005 :  SPECjbb2005 (Java Business Benchmark) measures the performance of a Java implemented application tier (server-side Java). The benchmark is based on the order processing of a wholesale supplier application. The metrics given are number of SPECjbb2005 bops (Business Operations per Second) and SPECjbb2005 bops/JVM (bops per JVM instance). 

IBM p6 570
HP 2660
Dell 2950
Space (RU)
1
4
2
2
Power Consumption (Watts)
464
560
563
300
Performance (BOPS/JVM)
170,153
87,737
80,884
74,218
Performance / Watt
366.7
156.7
143.7
247.4
SWaP
366.7
39.2
71.8
123.7
  • SPEC jAppServer2004  :  SPECjAppServer2004 is the only industry-standard benchmark used for Java Enterprise Edition application servers. In addition to testing application server performance, it also tests the database performance of servers deployed to support the application tier.
Database Tier, SPECjAppServer2004 2-Node Comparison Table

HPrx2660
Dell 2900
IBMp5+550
Space (RU)
1
2
5
4
Power Consumption (Watts)
338
559
350
770
Performance (SPECjApp JOPS)
2,000.92
874.17
652.95
1,197.51
Performance / Watt
5.2
1.6
1.9
1.6
SWaP
5.2
0.8
0.4
0.4

A word on SWaP

Given the state of environmental (global warming concerns), not to mention Power, Cooling, and floorspace costs, Sun has created the SWaP metric to compare and reflect the relative performance when taking into account the "space" (Rack Units), as well as "power consumption" (Watts).

The calculation is :      SWaP    =       Performance (operations or transactions per interval)
                                                            Space (RU)  x    Power Consumption (Watts)

Sun Benchmarks and Comparative Methodology :

For the purpose of internal only comparative benchmarking, Sun provides and maintains(internal-only) results for AMP v.2 and M-value benchmarks.

As stated very clearly at both sites (from the URL's noted above) :

The most appropriate way to size a Sun server for a specific application is to engage Sun's Competency Centers.  These centers will recommend a system using their knowledge about the application performance on Sun systems, and based on specific information from the customer.

Over the past 12 years with Sun, on several occations I've been brought into a mission critical production environment having performance issues just after going live.   The reasons which were most commonly the cause of this include :

  • NOT doing any type of actual Pre-Production Testing on the "target" configuration to be deployed, including :
    • NOT doing any sort of formal POC (Proof of Concept) with the target configuration to be purchased or migrated to.  A proof of concept is typically not an all-out formal benchmark effort, but would minimally give you the opportunity to run conduct Functional Testing, in addition to some simulated "production like" load tests against the configuration, that could demonstrate that the staged target environment will meet a  representative sample of "production like" workload.
    • NOT doing formal benchmarking, using a copy of Production data, and simulating actual samples of the most active production workloads (DB queries, Client Access patterns, Network traffic, etc..) with a tool such as LoadRunner.
  • While the reason above is nearly always the case, the cause that frequently crops up in most of these scenarios is that the only "Sizing" done was to compare the "before" and "after" M-Values to generate quotes !   I have encountered this even in mission critical environments, where NO pre-production staging or load testing was done ! This is just WRONG, it's more than a bad practice, possibly one that could get you fired if your (or your customer's) production environment goes down in flames after lots of $$ was spent !  Push back if necessary !
Use proper process and methodology in actually staging a new configuration and perform representative "production like" load testing.

\*\* Single-Core to Multi-Core (CMT) M-Value ("on paper") Comparisons Should CAUTIOUSLY be evaluated !! \*\*


Regarding the trend of migrating production environments from single-core to multi-core architectures, Beware!  Even though a generic benchmark test might reflect much higher #'s with fewer HW threads (and/or cores) than a current production configuration has, realize that there is a lot more to proper sizing and capacity planning (realizing that each configuration is unique) than is reflected in M/GHz (or in M-values) ! A lot can be said for certain types of workload requiring a specific # of HW cores for their environment to perform optimally "on cpu" without much cpu/kernel contention (locking, High TLB misses, context switching, and/or cpu-migrations..).


Hopefully this article has helped you reflect on the wide variety of CPU options available, as well as how they play such a significant role as the "cornerstone" of system architecture and overall performance of our customer's production environments.  Enjoy, and "let the chips fall (or rise) as they may"... :)

For more information regarding Performance Analysis, Capacity Planning, and  related Tools, see Todd's Blog at : http://blogs.sun.com/toddjobson/category/Performance+and+Capacity+Planning

\* Copyright 2007 Todd A. Jobson \* 

Sunday Sep 30, 2007

The Many Flavors of System Latency.. along the Critical Path of Peak Performance


From an article that I wrote last month, published in the September 2007 issue of Sun's Technocrat, this examination of System Latency starts where we left off with the last discussion What is Performance ? .. in the Real World .  That discussion identified the following list of key attributes and metrics that most in the IT world associate with optimal system performance :
  • Response Times (Client GUI's, Client/Server Transactions, Service Transactions, ..) Measured as "acceptable" Latency.
  • Throughput (how much Volume of data can be pushed through a specific subsystem.. IO, Network, etc...)
  • Transaction Rates (DataBase, Application Services, Infrastructure / OS / Network.. Services, etc.).  These can be either rates per Second, Hour, or even Day... measuring various service-related transactions.
  • Failure Rates (# or Frequency of exceeding High or Low Water Marks .. aka Threshold Exceptions)
  • Resource Utilization (CPU Kernel vs. User vs. Idle, Memory Consumption, etc..)
  • Startup Time (System HW, OS boot, Volume Mgmt Mirroring, Filesystem validation, Cluster Data Services, etc..)
  • FailOver / Recovery Time (HA clustered DataServices, Disaster Recovery of a Geographic Service, ..)  Time to recover a failed Service (includes recovery and/or startup time of restoring the failed Service)
  • etc ...

Each of the attributes and perceived gauges of performance listed above have their own intrinsic relationships and dependencies to specific subsystems and components... in turn reflecting a type of "latency" (delay in response). It is these latencies that are investigated and examined for root cause and correlation as the basis for most Performance Analysis activities.

How do you define Latency ?

In the past, the most commonly used terminology relating to latency within the field of Computer Science had been "Rotational Latency". This was due to the huge discrepancy between the responsiveness of an operation requiring mechanical movement, vs. the flow of electrons between components, where previously the discrepancy was astronomical (nano seconds vs. milliseconds).  Although the most common bottlenecks do typically relate to physical disk-based I/O latency, the paradigm of latency is shifting.  With today's built in HW caching controllers and memory resident DB's, (along with other optimizations at the HW, media, drivers, and protocols...), the gap has narrowed. Realize that in 1 nanosecond (1 billionth of a second), electricity can travel approximately one foot down a wire (approaching the speed of light). 

However, given the industry's latest cpu's running multiple cores at clock speeds upwards of multiple GigaHertz (with >= 1 thread per core,  each theoretically executing > 1+ billion  instructions per second...), many bottlenecks can  now easily be realized within memory, where the densities have increased dramatically, the distances across huge supercomputer buses (and grids) have expanded dramatically, and most significantly.. the latency of memory has not decreased at the same rate as cpu speed increases. In order to best investigate system latency, we first need to define it and fully understand what we're dealing with.

LATENCY :

  • noun               The delay, or time that it takes prior to a function, operation, and/or transaction occurring.  (my own definition)
  • adj   (Latent)   Present or potential but not evident or active.
BOTTLENECK :
  • noun               A place or stage in a process at which progress is impeded.
THROUGHPUT :
  • noun              Output relative to input; the amount of data passing through a system from input to output.
BANDWIDTH :
  • noun              The amount of data that can be passed along a communications channel in a given period of time.

(definitions cited from www.dictionary.com)

 

The "Application Environment" and it's basic subsystems :

 

Once again, the all-inclusive entity that we need to realize and examine in it's entirety is the "Application Environment", and it's standard subsystems :

  • OS / Kernel (System processing)
  • Processors / CPU's
  • Memory
  • Storage related I/O
  • Network related I/O
  • Application (User) SW

 

The "Critical Path" of (End-to-End) System Performance :

Although system performance might frequently be associated with one (or a few) system metrics, we must take 10 steps back and realize that overall system performance is one long inter-related sequence of events (both parallel and sequential). Depending on the type of workload and services running within an Application Environment, the Critical Path might vary, as each system has it's own performance profile and related "personality. Using the typical OLTP RDBMS environment as an example, the Critical Path would include everything (and ALL Latencies incurred) between :

Client Node / User -> Client GUI -> Client Application / Services -> Client OS / Kernel -> Client HW -> NICs -> Client LAN -> (network / naming services, etc.. ) -> WAN (switches, routers, ...) -> ... Network Load Balancing Devices

-> Middleware / Tier(s) -> Web Server(s) -> Application Server(s) -> Directory, Naming, NFS... Servers/Services->

-> RDBMS Server(s) [Infrastructure Svcs, Application SW, OS / kernel, VM, FS / Cache, Device Drivers, System HW, HBA's, ...] -> External SAN /NAS I/O [ Switches, Zones/Paths, Array(s), Controllers, HW Cache, LUN(s), Disk Drives, .. ] -> RDBMS Svr ... LAN ...... -> ... and back to the Client Node through the WAN, etc... <<-

(NOTE: MANY sub-system components / interactions are left out in this example of a transaction and response between a client and DB Server)

 

Categories of Latency :

Latency, in and of itself, simply refers to a delay of sorts.  In the realm of Performance Analysis and Workload Characterization, an association can generally be made between certain types of latency and a specific sub-system "bottleneck".  However, in many cases the underlying "root causes of bottlenecks are the result of several overlapping conditions, none of which individually cause performance degradation, but together can result in a bottleneck. It is for this reason that performance analysis is typically an iterative exercise, where the removal of one bottleneck can easily result in the creation of another "hot spot elsewhere, requiring further investigation and /or correlation once a bottleneck has been removed.

 

Internal vs. External Latency ...

Internal Forms of Latency :

  • CPU Saturation (100% Utilization, High Run Queues, Blocked Kthreads, Cpu Contention ... Migrations / Context Switching / ... SMTX, ..)
  • Memory Contention (100% Utilization, Allocation Latency due to either location, Translation, and/or paging/swapping, ...)
  • OS Kernel Contention Overhead ( aka .. "Thrashing" due to saturation.. )
  • IO Latency ( Hot Spots, High Svc Times, ...)
  • Network Latency
  • OS Infrastructure Service Latency (Telnet, FTP, Naming Svcs, ...)
  • Application SW / Services (Application Libraries, JVM, DB, ...)

External Forms of Latency :

  • SAN or External Storage Devices (Arrays, LUNS, Controllers, Disk Drives, Switches, NAS, ...)
  • LAN/WAN Device Latency (Switches, Routers, Collisions, Duplicate IP's, Media Errors, ....)
  • External Services .. DNS, NIS, NFS, LDAP, SNMP, SMTP, DB, ....)
  • Protocol Latency (NACK's, .. Collisions, Errors, etc...)
  • Client Side Latency


Perceived vs. Actual Latency ...

For anyone that has worked in the field with end-users, they have likely experienced scenarios where users will attribute a change in application behavior to a performance issue, in many cases incorrectly. The following is a short list of the top reasons for a lapse in user perception of system performance :

  • Mis-Alignment of user expectations, vantage points, anticipation, etc.. (Responsiveness / Response Times, ...)
  • Deceptive expectations based upon marketing "PEAK" Throughput and/or CPU clock-speed #'s and promised increases in performance.  (high clock speeds do NOT always equate to higher throughput or better overall performance, especially if ANY bottlenecks are present)
  • PEAK Throughput #'s can only be achieved if there is NO bottleneck or related latency along the critical path as described above. The saturation of ANY sub-system will degrade the performance until that bottleneck is removed.

    The PEAK Performance of a system will be dictated by the performance of it's most latent and/or contentious components (or sub-systems) along the critical path of system performance. (eg. The PEAK bandwidth of a system is no greater than that of it's slowest components along the path of a transaction and all it's interactions.)

    As the holy grail of system performance (along with Capacity Planning.. and ROI) dictates, ... a system that allows for as close to 100% of CPU processing time as possible (vs. WAIT events that pause processing) is what every  IT Architect and System Administrator strives for.   This is where systems using CMT (multiple cores per cpu, each with multiple threads per core) shine, allowing for more processing to continue even when many threads are waiting on I/O.

     

     

    The Application Environment and it's Sub-Systems ... where the bottlenecks can be found

     

    Within Computing, or more broadly, Information Technology, "latency" and it's underlying causes can be tied to one or more specific "sub-systems". The following list reflects the first level of "sub-systems" that you will find for any Application Environment :

    Subsystem / Components

    Attributes and key Characteristics

    Related Metrics, Measurements, and/or Interactions

    System "Bus" / Backplane

    Backplane / centerplane, I/O Bus, etc.. (many types of connectivity and media are possible, all with individual response times and bandwidth properties).

    Busstat output, aggregated total throughput #'s (from kstat, etc..)

    CPU's

    # Cores, # HW Threads per core, Clock speed / Frequency in Ghz (cycles per second), Operations (instructions) per Sec, Cache, DMA, etc..

    vmstat, trapstat, cpustat, cputrack, mpstat, ... (Run Queue, Blocked Kthreads, ITLB_Misses, % S/U/Idle Utilization, # lwp's, ...)

    Memory / Cache

    Speed/Frequency of Bus, Bandwidth of Bus, Bus Latency, DMA Config, L1/L2/L3 Cache Locations/ Sizes, FS page cache, Physical Proximity of Cache and/or RAM, FS page caching, tmpfs, pagesizes, ..

    vmstat, pmap, mdb, kstat, prstat, trapstat, ipcs, pagesize, swap, ... (Cache Misses, DTLB_misses, Page Scan Rate, heap/stack/kernel sizes,..)

    Controllers (NIC's, HBA's, ..)

    NIC RX Interrupt Saturation, NIC Overflows, NIC / HBA Caching, HBA SW vs. HW RAID, Bus/Controller Bridges/Switches, DMP, MPxIO, ...

    netstat, kstat (RX Pkts / Sec, Network Errors, ...) , iostat, vxstat.. (Response Times, Storage device Svc_times..), lockstat, intrstat, ...

    Disk Based Devices

    Boot Devices, RAID LUN's, File Systems (types, block sizes, ...), Volumes, RAID configuration (stripes, mirrors, RAID Level, paths,...), physical fragmentation, Mpxio, etc..

    iostat, vxstat, kstat, dtrace, statspack, .. (%wait, Service Times, blocked kernel threads, ... FS/LUN Hot Spots)

    OS / Kernel

    Process Scheduling, Virtual Memory Mgmt, HW Mgmt/Control, Interrupt handling, polling, system calls, ...

    vmstat (utilization, interrupts, syscalls, %Sys / % Usr, ...), prstat, top, mpstat, ps, lockstat (for smtx, lock, spin.. contention), ...

    OS Infrastructure Services

    FTP, Telnet, BIND/DNS, Naming Svcs, LDAP, Authentication/Authoriz., ..

    prstat, ps, svcadm, .. various ..

    Application Services

    DB Svr, Web Svr, Application Svr, ...

     various...

     

Note, if you want a single Solaris utility to do the heavy lifting, performance / workload correlation, and reporting for you, take a look at sys_diag if you haven't already done so (or the README).

 

Media/ Transport Bandwidth and related Latencies :

 

The following table demonstrates the wide range of typical operating frequencies and latencies PER Sub-System, Component, and/or Media Type :

Component / Transport Media

Response Time / Frequency / Speed

 Throughput / Bandwidth

CPU

> 1+ Giga Hertz (1+ billion cycles per second)
\*  (# cores \* HW Threads / core)

>1 billion operations per second
(huge theoretical #ops/s per system)

Memory

DDR (PC-3200@200MHz/200MHz bus) ~5ns

DDR2 (PC2-5300@166MHz/333MHz bus) ~ 6ns

DDR2 (PC2-8500@266MHz/533MHz bus) ~ 3.75ns  <TBD>

nanoseconds (billionths of a second)

DDR-400 Peak Transfer 3.2 GB/s


DDR2-667 Pk Transfer 5.3GB/s

DDR2-1066 Pk Transfer 8.5GB/s <TBD>


Disk Devices

Service Times : ~5+ ms =
~ X ms Latency   +  Y ms Seek Times   
(1 millisecond = 1000th of a second)
[platter size, # cylinders/ platters, RPM,...]

varies greatly, see below

Ultra 320 SCSI (16 bit) parallel

(high performance, cable & dev limitations..)

Up to 320 MBps

SAS [Serial Attached SCSI]

Current
Future <TBD>

> 300 MBps (>3 Gbps)
Up to 1200 MBps <TBD>

SATA [Serial ATA]

low cost, higher capacity (poor performance)
Future <TBD>

Up to 300 MBps
Up to 600 MBps <TBD>

USB 2.0
10-200+ Microseconds
(1 microsecond [us] = 1 millionth of a second)
up to 480 Mbps (60 MBps)             ~40 MBps Real-World Usable
FireWire (IEEE 1394)

Up to 50 MBps

Fiber Channel (Dual Ch)

4 Gb  (4 / 2 / 1 Gb) \*2
8 Gb  (8 / 4 / 2 Gb) \*2  <TBD>

Up to 1.6 GBps (1 GB Usable)

Up to 3.2 GBps (1.8 GB Usable)

1 Gigabit Ethernet

\*\* Latency ~ 50 us [microseconds] \*\*

125 MBps (~1 Gbps) theoretical

10 Gigabit Ethernet

Up to 20 Gbps (<= 9 Gbps Usable)

Infiniband (Dual Ported HCA)

x4 (SDR / DDR) Dual Ported= \*2

\*\* Latency < 2 microseconds \*\*
x8 (DDR) \*2  <TBD>

2\*10Gb= 20 Gbps (16Gbps Usable)


Up to 40 Gbps (32 Gbps Usable)
PCI 2.2
32 bit @ 33 MHz
64 bit @ 33 MHz
64 bit @ 66MHz
133 MBps
266 MBps
533 MBps
PCI-X
64 bit bus width @ 100 MHz (parallel bus)
64 bit bus width @ 133 MHz (parallel bus)
Up to 800 MB/s
1066 MBps (1 GBps)
PCI-Express
v.1 serial bus / bi-directional @ 2.5 GHz


v.2  @ 5 GHz   <TBD>
(10's -100's of nanoseconds for latencies)
4 GBps (x16 lanes) one direction
8 GBps (x32 lanes) one direction
Up to 16 GBps bi-directional (x32)

32 GBps bi-directional (x32 lanes)

 

Other Considerations Regarding System Latency :

Other considerations regarding system latency that are often overlooked include the following, which offers us a more holistic vantage point of system performance and items that might work against "Peak system capabilities :

  • For Application SW that supports advanced capabilities such as Infiniband RDMA (Remote Direct Memory Access), interconnect latencies can be virtually eliminated via Application RDMA "kernel bypass".  This would be applicable in an HPC grid and/or possibly  Oracle RAC Deployments, etc. (confirming certifications of SW/HW..).
  • Level of Multi-Threading vs. Monolithic serial or "batch" jobs (If Applications are not Multi-Threaded, then SMP and/or CMT systems with multiple processors / cores will likely always remain under-utilized).
  • Architectural configurations supporting load distribution across multiple devices / paths (cpu's, cores, NIC's, HBA's, Switches, LUNs, Drives, ...)
  • System Over Utilization (too much running on one system.. due to under-sizing or over-growth, resulting in system "Thrashing" overhead)
  • External Latency Due to Network and/or SAN I/O Contention
  • Saturated Sub-Systems / Devices (NIC's, HBA's, Ports, Switches, ...) create system overhead handling the contention.
  • Excessive Interrupt Handling (vs. Polling, Msg passing, etc..), resulting in overhead where Interrupt Handling can cause CPU migrations / context switching (interrupts have the HIGHEST priority within the Solaris Kernel, and are handled even before RT processing, preempting running threads if necessary).   Note, this can easily occur with NIC cards/ports that become saturated (> ~25K RX pkts/sec), especially for older drivers and/or over-utilized systems.
  • Java Garbage collection Overhead (sub-par programming practices, or more frequently OLD JVM's, and/or missing compilation optimizations).
  • Use of Binaries that are compiled generically using GCC, vs. HW optimized compilations using Sun's Studio Compilers (Sun Studio 12 can give you 200% + better performance than gcc binaries).
  • Virtualization Overhead (significant overhead relating to traps and library calls... when using VmWare, etc..)
  • System Monitoring Overhead (the cumulative impact of monitoring utilities, tools, system accounting, ... as well as the IO incurred to store that historical performance trending data).
  • OS and/or SW ... Patches, Bugs, Upgrades (newly applied, or possibly missing)
  • Systems that are MIS-tuned, are accidents waiting to happen.  Only Tune kernel/drivers if you KNOW what you are doing, or have been instructed by support to do so (and have FIRST tested on a NON-production system).  I can't tell you how many performance issues I have encountered that were to do administrator "tweaks" to kernel tunables (to the point of taking down entire LAN segments !).  The defaults are generally the BEST starting point unless a world-class benchmarking effort is under-way.

 

The "Iterative" nature of Performance Analysis and System Tuning

No matter what the root causes are found to be, in the realm of Performance Analysis and system Tuning, ... once you remove one bottleneck, the system processing characteristics will change, resulting in a new performance profile, and new "hot spots" that require further data collection and analysis. The process is iterative, and requires a methodical approach to remediation.

Make certain that ONLY ONE (1) change is made at a time, otherwise, the effects ( + or - ) can not be quantified.

Hopefully at some point in the future we'll be operating at latencies measured in attoseconds (10 \^-18th, or 1 quintillionth of a second), but until then .... Happy tuning :)

For more information regarding Performance Analysis, Capacity Planning, and related Tools, review some of my other postings at :  http://blogs.sun.com/toddjobson/category/Performance+and+Capacity+Planning

 

Copyright 2007  Todd A. Jobson

About

This blog does not reflect the viewpoint or opinions of Oracle or Sun Microsystems. All comments are personal reflections and responsibility of Todd A. Jobson, and are copyrighted from the posted year to current year, to that effect.

Search

Archives
« July 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
  
       
Today