UltraSPARC T1 large page projects in Solaris

A Translation Lookaside Buffer (TLB) is a hardware cache that is used
to translate a processes's virtual address to a physical memory address.
UltraSPARC T1 has a 64 entry Intruction TLB and a 64 entry Data TLB per
core. The unit of translation is page size, UltraSPARC T1 supports
4 page sizes, 8k (the default), 64k, 4m and 256MB. When memory is
accesssed and the mapping is not in the TLB this is termed a TLB miss.
Excessive TLB misses are bad for performance

The 64 entry TLBs are relatively small compared to current SPARC processors
They have the advantage, however, that you can mix and match all page
sizes in the TLB ie the TLB does not have to be programmed to a particular page size.
One entry can be 8k for instance and the next one can be 256MB.

We know that TLB performance was going to be critical for UltraSPARC T1. Early on
we started a number of Solaris projects to provide optimal TLB performance
for the processor.

The first project was MPSS for Vnodes (VMPSS) also known as large
pages for text and libraries. Before this project binary text and library
segments were always placed on 8k pages. For large binaries such as Oracle
or SAP, or applications with a large number of libraries, this results in a
high number of ITLB misses per second.

VMPSS provides in kernel infrastructure and mechanisms so that large pages
can be used with file mappings that are text and initdata segments of binaries
and libs. Text and libraries are placed on the largest page size possible.
For smaller binaries and libraries this is usually 64k pages but for bigger
binaries such as Oracle it is 4MB mappings.

Use pmap -xs to see the pagesize that has been allocated. The first entries in
the output is the binary itself, libraries are usually twards the end of the listing.

14148: /usr/sap/SSO/SYS/exe/run/saposcol
Address Kbytes RSS Anon Locked Pgsz Mode Mapped File
0000000100000000 320 320 - - 64K r-x-- saposcol

The performance gains on UltraSparcT1 were significant, up to 10% on some Oracle
workloads.

The second TLB related project was Large Pages for Kernel Memory which provides
large pages for the kernel heap. The kernel is a particularly bad TLB miss
offender. Code generally spends less time in the kernel and so
on entry the TLB is usually cold. Prior to this project the kernel heap
has been mapped on 8k pages. We saw moderate performance gains with this project.

The third project added was Large Page OOB (out-of-the-box) Performance
The Multiple Page Size Support (MPSS) project in Solaris 9 added support
for pagesizes other than 8k. MPSS Environment variables needed
to be set and a library mpss.so.1 preloaded prior to running an application
The aim of the MPOOB project was to bring the benefits of large pages to a
broader range of applications out-of-the-box, without requiring
the need for the MPSS variables.

MPOOB affects the allocation of heap, stack and anon pages.

Again check if large pages we obtained using pmap -xs

0000000104858000 32 32 32 - 8K rwx-- [ heap ]
0000000104860000 3712 3712 3712 - 64K rwx-- [ heap ]
0000000104C00000 8192 8192 8192 - 4M rwx-- [ heap ]

This is a huge win for our customers, freeing them from the need to set environment
variables to tune the TLBs.

If for some reason pages are not allocated correctly by default the MPSS variables
can still be used to override. See manual sentry for mpss.so.1 for details

The fourth large page project was support for 256MB aka Giant pages on
UltraSPARC T1 systems. This project was actually added as part of the
UltraSPARC IV+ project, however the TLB programming is different on Niagara.

For an allocation to be a candidate for 256mb pages it most have the following
characteristics

- At least 256mb in size
- Be aligned on a 256MB address

Giant pages can only be allocated on a 256mb address boundary. If the
allocation is greater than 256mb Solaris will attempt to use 256mb pages
at the next boundary. Solaris will attempt to allocate 8k, 64k and 4mb pages until
the boundary is reached.

One of the biggest performance gains from giant pages is in the Oracle SGA
which is allocated as System V shared memory. If the SGA is large it should
be allocated on giant pages. Again use pmap -xs to confirm

0000000380000000 25427968 25427968 - 25427968 256M rwxsR [ ism shmid=0x3 ]
0000000990000000 16384 16384 - 16384 4M rwxsR [ ism shmid=0x3 ]
0000000991000000 56 56 - 56 8K rwxsR [ ism shmid=0x3 ]

In the previous example the first 25GB of SGA is allocated on 256mb pages, there is a tail
at the end that is first allocated on 4mb pages. The residue is 56k which is allocated
on 8k pages.

The final project added to Solaris was Large Page Availability. The aim of this
project was to increase the number of large pages in the system and improving the
efficiency of creating large pages. This project is largely hidden from the end user.
It is key however to ensuring applications can allocate large pages.

To determine how well the TLBs are doing use the trapstat command. The trapstat -T
option breaks down the data as follows

- Per hardware Strand
- User and kernel
- Pagesize wwithing each mode

On an 8 core 32 thread UltraSPARC T1 system the output is very long. The example
below gives the last strands output plus a total

cpu m size| itlb-miss %tim itsb-miss %tim | dtlb-miss %tim dtsb-miss %tim |%tim
----------+-------------------------------+-------------------------------+----
31 u 8k| 989 0.1 5 0.0 | 28050 1.2 3 0.0 | 1.3
31 u 64k| 2510 0.2 0 0.0 | 139354 5.4 4 0.0 | 5.6
31 u 4m| 2768 0.2 0 0.0 | 94936 4.5 0 0.0 | 4.7
31 u 256m| 0 0.0 0 0.0 | 79590 3.6 0 0.0 | 3.6
- - - - - + - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - + - -
31 k 8k| 1921 0.1 0 0.0 | 35701 1.3 6 0.0 | 1.4
31 k 64k| 0 0.0 0 0.0 | 330 0.0 0 0.0 | 0.0
31 k 4m| 0 0.0 0 0.0 | 71 0.0 0 0.0 | 0.0
31 k 256m| 0 0.0 0 0.0 | 3388 0.2 4 0.0 | 0.2
==========+===============================+===============================+====
ttl | 278212 0.6 68 0.0 | 12334583 16.4 368 0.0 |16.9

Note the difference with traditional Sparc processors - 512k pages have been dropped
and new 256mb entries added.

In this example we see hardly any iTLB misses, this is because of large pages for text
and libraries. There are also 256MB page misses in the kernel indicating large pages
for kernel heap is also in operation.


[ Technorati: NiagaraCMT, ]

Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

denissheahan

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Bookmarks