T4 Performance Counters explained

Now that T4 is out for a few months some people might have wondered what details of the new pipeline you can monitor. A "cpustat -h" lists a lot of events that can be monitored, and only very few are self-explanatory. I will try to give some insight on all of them, some of these "PIC events" require an in-depth knowledge of T4 pipeline. Over time I will try to explain these, for the time being these events should simply be ignored.

Notes:

  • some counters changed from tape-out 1.1 (*only* used in the T4 beta program) to tape-out 1.2 (used in the systems shipping today) The table only lists the tape-out 1.2 counters
  • I marked some counters "no good use known". That means there is no sensible application area (yet?) known. Some counters become useless while the overall design is being developed.  
  • check back often, I will every once in while edit this page. The use cases for these counters may change over time while we learn.

pic name (cpustat)

Prose Comment

Sel-pipe-drain-cycles,
Sel-0-[wait|ready],
Sel-[1,2]

Sel-0-wait counts cycles a strand waits to be selected. Some reasons can be counted in detail; these are:

  • Sel-0-ready: Cycles a strand was ready but not selected, that can signal pipeline oversubscription
  • Sel-pipe-drain-cycles: number of cycles a thread waits after a branch mispredict with correct path instruction at select for the mispredict to resolve
Sel-1,2 count how often one strand only selected one or two µop

Pick-any, Pick-[0|1|2|3]

Cycles one, two, three, no or at least one instruction or µop is picked

Instr_FGU_crypto, Instr_ld,
  Instr_st, SPR_ring_ops,
  Instr_other

Number of instructions executed on a strand, distinguished by type.

Instr_all

total number of instructions executed on that strand

Sw_count_intr

Nr of S/W count instructions on that vcpu (sethi %hi(fc000),%g0 (whatever that is)) 

Atomics

nr of atomic ops, which are LDSTUB/a, CASA/XA, and SWAP/A

SW_prefetch

Nr of PREFETCH or PREFETCHA instructions

Block_ld_st

Block loads or store on that vcpu

IC_miss_nospec,
IC_miss_[L2_or_L3|local|remote]\
_hit_nospec

Various I$ misses, distinguished by where they hit. All of these count per thread, but only primary events: T4 counts only the first occurence of an I$ miss on a core for a certain instruction. If one strand misses in I$ this miss is counted, but if a second strand on the same core misses while the first miss is being resolved, that second miss is not counted
This flavour of I$ misses counts only misses that are caused by instruction that really commit (note the "_nospec")

BTC_miss

Branch target cache miss

ITLB_miss

ITLB misses (synchronously counted)

ITLB_miss_asynch

dto. but asynchronously

[I|D]TLB_fill_\
[8KB|64KB|4MB|256MB|2GB|trap]

H/W tablewalk events that fill ITLB or DTLB with translation for the corresponding page size. The “_trap” event occurs if the HWTW was not able to fill the corresponding TLB

IC_mtag_miss,
IC_mtag_miss_\
[ptag_hit|ptag_miss|\
ptag_hit_way_mismatch]

I$ micro tag misses, with some options for drill down

Fetch-0, Fetch-0-all

fetch-0 counts nr of cycles nothing was fetched for this particular strand, fetch-0-all counts cycles nothing was fetched for all strands on a core

Instr_buffer_full

Cycles the instruction buffer for a strand was full, thereby preventing any fetch

BTC_targ_incorrect

Counts all occurences of wrongly predicted branch targets from the BTC

[PQ|ROB|LB|ROB_LB|SB|\
ROB_SB|LB_SB|RB_LB_SB|\
DTLB_miss]\
_tag_wait

These counters monitor pipeline behaviour therefore they are not strand specific:

  • PQ_...: cycles Rename stage waits for a Pick Queue tag (might signal memory bound workload for single thread mode)
  • ROB_...: cycles Select stage waits for a ROB (ReOrderBuffer) tag
  • LB_...: cycles Select stage waits for a Load Buffer tag
  • SB_...: cycles Select stage waits for Store Buffer tag
  • combinations of the above are allowed, although some of these events can overlap, the counter will only be incremented once per cycle if any of these occur
  • DTLB_...: cycles load or store instructions wait at Pick stage for a DTLB miss tag

[ID]TLB_HWTW_\
[L2_hit|L3_hit|L3_miss|all]

Counters for HWTW accesses caused by either DTLB or ITLB misses. Canbe further detailed by where they hit

IC_miss_L2_L3_hit,
IC_miss_local_remote_remL3_hit,
IC_miss

I$ prefetches that were dropped because they either miss in L2$ or L3$
This variant counts misses regardless if the causing instruction commits or not

DC_miss_nospec, DC_miss_[L2_L3|local|remote_L3]\
_hit_nospec

D$ misses either in general or detailed by where they hit
cf. the explanation for the IC_miss in two flavours for an explanation of _nospec and the reasoning for two DC_miss counters

DTLB_miss_asynch

counts all DTLB misses asynchronously, there is no way to count them synchronously

DC_pref_drop_DC_hit, SW_pref_drop_[DC_hit|buffer_full]

L1-D$ h/w prefetches that were dropped because of a D$ hit, counted per core. The others count software prefetches per strand

[Full|Partial]_RAW_hit_st_[buf|q]

Count events where a load wants to get data that has not yet been stored, i. e. it is still inside the pipeline. The data might be either still in the store buffer or in the store queue. If the load's data matches in the SB and in the store queue the data in buffer takes precedence of course since it is younger

[IC|DC]_evict_invalid,
[IC|DC|L1]_snoop_invalid,
[IC|DC|L1]_invalid_all

Counter for invalidated cache evictions per core

St_q_tag_wait

Number of cycles pipeline waits for a store queue tag, of course counted per core

Data_pref_[drop_L2|drop_L3|\
hit_L2|hit_L3|\
hit_local|hit_remote]

Data prefetches that can be further detailed by either why they were dropped or where they did hit

St_hit_[L2|L3],
St_L2_[local|remote]_C2C,
St_local, St_remote

Store events distinguished by where they hit or where they cause a L2 cache-to-cache transfer, i.e. either a transfer from another L2$ on the same die or from a different die

DC_miss, DC_miss_\
[L2_L3|local|remote]_hit

D$ misses either in general or detailed by where they hit
cf. the explanation for the IC_miss in two flavours for an explanation of _nospec and the reasoning for two DC_miss counters

L2_[clean|dirty]_evict

Per core clean or dirty L2$ evictions. L2_clean_evict can signal an instruction fetcher bottleneck: instruction can be considered clean because they don't change. If a high number of clean eviction occurs one might be evicting instructions from L2$ that have not been fetched fast enough

L2_fill_buf_full,
L2_wb_buf_full,
L2_miss_buf_full

Per core L2$ buffer events, all count number of cycles that this state was present

no good use known

L2_pipe_stall

Per core cycles pipeline stalled because of L2$

no good use known

Branches

Count branches (Tcc, DONE, RETRY, and SIT are not counted as branches)

Br_taken

Counts taken branches (Tcc, DONE, RETRY, and SIT are not counted as branches)

Br_mispred,
Br_dir_mispred,
Br_trg_mispred,
Br_trg_mispred_\
[far_tbl|indir_tbl|ret_stk]

Counter for various branch misprediction events. 

Cycles_user

counts cycles, attribute setting hpriv, nouser, sys controls addess space to count in

Commit-[0|1|2],
Commit-0-all,
Commit-1-or-2

Number of times either no, one, or two µops commit for a strand. Commit-0-all counts number of times no µop commits for the whole core

The various types of instructions that are monitored by the Instr_* counters are:

  • FGU operations are floating point and crypto operations 
  • Load operations: LDD, LDDA_0X14, LDDA_0X15, LDDA_0X1C, LDDA_0X1D, LDDA_0X22, LDDA_0X23, LDDA_0X26, LDDA_0X27,  LDDA_0X2A, LDDA_0X2B, LDDA_0X2E, LDDA_0X2F, LDDA_0X80, LDDA_0X82, LDDA_0X88, LDDA_0X8B, LDDA_0XE2, LDDA_0XE3, LDDA_0XEA, LDDA_0XEB, LDDF, LDDFA_0X16, LDDFA_0X1E, LDDFA_0X4, LDDFA_0X80, LDDFA_0X88, LDDFA_0X89, LDDFA_0X8A, LDDFA_0XC, LDDFA_0XD0, LDDFA_0XDA, LDDFA_0XF0, LDDFA_0XF8, LDF, LDFA_0X15,LDFA_0X1D, LDFA_0X4, LDFA_0X80, LDFA_0X81, LDFA_0X82, LDFA_0X83, LDFA_0X88, LDFA_0X89, LDFA_0X8B,LDFA_0XC, LDFSR, LDSB, LDSBA_0X14, LDSBA_0X15, LDSBA_0X1C, LDSBA_0X1D, LDSBA_0X31, LDSBA_0X4, LDSBA_0X80, LDSBA_0X81, LDSBA_0X88, LDSBA_0X8B, LDSBA_0XC, LDSH, LDSHA_0X14, LDSHA_0X1C, LDSHA_0X1D, LDSHA_0X36, LDSHA_0X38, LDSHA_0X39, LDSHA_0X3E, LDSHA_0X80, LDSHA_0X82, LDSHA_0X83, LDSHA_0X88, LDSHA_0X8A, LDSHA_0X8B, LDSW, LDSWA_0X14, LDSWA_0X15, LDSWA_0X1C, LDSWA_0X1D, LDSWA_0X36, LDSWA_0X80, LDSWA_0X83, LDSWA_0X88, LDSWA_0X89, LDSWA_0X8B, LDUBA_0X14, LDUBA_0X15, LDUBA_0X1C, LDUBA_0X1D, LDUBA_0X30, LDUBA_0X38, LDUBA_0X4, LDUBA_0X80, LDUBA_0X88, LDUBA_0X8B, LDUH, LDUHA_0X14, LDUHA_0X15, LDUHA_0X1C, LDUHA_0X1D, LDUHA_0X39, LDUHA_0X3E, LDUHA_0X4, LDUHA_0X80, LDUHA_0X83, LDUHA_0X88, LDUHA_0XC, LDUW, LDUWA_0X14, LDUWA_0X15, LDUWA_0X1C, LDUWA_0X1D, LDUWA_0X31, LDUWA_0X39, LDUWA_0X3E, LDUWA_0X4, LDUWA_0X80, LDUWA_0X81, LDUWA_0X82, LDUWA_0X83, LDUWA_0X88, LDUWA_0X89, LDUWA_0X8A, LDUWA_0X8B, LDUWA_0XC, LDX, LDXA_0X14, LDXA_0X15, LDXA_0X1C, LDXA_0X1D, LDXA_0X4, LDXA_0X41, LDXA_0X63, LDXA_0X80, LDXA_0X81, LDXA_0X82, LDXA_0X83, LDXA_0X88, LDXA_0X89, LDXA_0X8B, LDXA_0XC, RDASI, RDASR, RDCCR, RDFPRS, RDGSR, RDHPR, RDPR, RDTICK, RDY
  • Store operations: FLUSH, MEMBAR except MEMBAR #Lookaside and MEMBAR #LoadStore, STBA_0X14, STBA_0X15, STBA_0X1C, STBA_0X1D, STBA_0X30, STBA_0X31, STBA_0X36, STBA_0X38, STBA_0X39, STBA_0X3E, STBA_0X4, STBA_0X80, STBA_0X81, STBA_0X88, STBA_0X89, STBA_0XC, STBAR, STD, STDA_0X14, STDA_0X15, STDA_0X1C, STDA_0X1D, STDA_0X27, STDA_0X2F, STDA_0X4, STDA_0X80, STDA_0X88, STDA_0XEA, STDA_0XEB, STDF, STDFA_0X17,STDFA_0X4, STDFA_0X80, STDFA_0X88, STDFA_0X89, STDFA_0XC0, STDFA_0XC2, STDFA_0XC4, STDFA_0XC8,STDFA_0XCA, STDFA_0XCB, STDFA_0XCC, STDFA_0XD1, STDFA_0XD3, STDFA_0XD9, STDFA_0XE0, STDFA_0XF0,STDFA_0XF8, STDFA_0XF9, STF, STFA_0X14, STFA_0X15, STFA_0X1C, STFA_0X4, STFA_0X80, STFA_0X81, STFA_0X88,STFA_0X89, STFA_0XC, STFSR, STHA_0X14, STHA_0X15, STHA_0X1C, STHA_0X1D, STHA_0X30, STHA_0X31,STHA_0X4, STHA_0X80, STHA_0X88, STHA_0XC, STW, STWA_0X14, STWA_0X15, STWA_0X1C, STWA_0X1D, STWA_0X38, STWA_0X4, STWA_0X80, STWA_0X81, STWA_0X88, STWA_0XC, STX, STXA_0X14, STXA_0X15, STXA_0X1C, STXA_0X1D, STXA_0X22, STXA_0X23, STXA_0X26, STXA_0X27, STXA_0X2A, STXA_0X2B, STXA_0X2E, STXA_0X2F, STXA_0X41, STXA_0X73, STXA_0X80, STXA_0X81, STXA_0X88, STXA_0XC, STXA_0XE2, STXA_0XE3, STXA_0XEA, STXA_0XEB, STXA_0XF2, STXA_0XF3, STXA_0XFA, STXA_0XFB
  • SPR ring operations: LDXA_0X20, LDXA_0X21, LDXA_0X25, LDXA_0X45, LDXA_0X48, LDXA_0X49, LDXA_0X4C, LDXA_0X4E, LDXA_0X4F, LDXA_0X52, LDXA_0X54, LDXA_0X58, LDXA_0X64, LDXA_0X74, LDXA_0XB0, STXA_0X20, STXA_0X21,STXA_0X25, STXA_0X42, STXA_0X45, STXA_0X4C, STXA_0X4E, STXA_0X4F, STXA_0X50, STXA_0X52, STXA_0X54,STXA_0X57, STXA_0X58, STXA_0X5C, STXA_0X5F, STXA_0X64, STXA_0X72, STXA_0XB0, WRASI, WRASR, WRCCR,WRFPRS, WRGSR, WRHPR, WRPAUSE, WRPR, WRY, STICK_ENABLE
  • Other instructions: ADD, ADDCcc, ADDXCcc, ALLCLEAN, AND, ANDcc, ANDN, ANDNcc, CASA, CASXA, DONE, FLUSHW, HALT, INVALW, LDSTUB, LDSTUBA, MEMBAR #Lookaside, MEMBAR #LoadStore, MOVdTOX, MOVFN, MOVR, MOVxTOd, NOP, NORMALW, OR, ORcc, ORNcc, OTHERW, PREFETCH, PREFETCHA, RDCFR, RDPC, RESTORED, RETRY, SAVE, SAVED, SETHI_%G0, SIAM, SLLX, SRAX, SRLX, SUB, SUBcc, SUBCcc, SWAP, SWAPA, TADDcc, TADDccTV, TN, TSUBcc, TSUBccTV, WRCFR, XNOR, XNORcc, XOR, XORcc.

Comments:

Post a Comment:
Comments are closed for this entry.
About

Before Sun was acquired by Oracle I was about 12 yrs in pre-sales covering SPARC and Solaris. Today I work in a field role in Oracle Microelectronics and focus on SPARC performance, including working and presenting at customer sites all over EMEA

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today