Understanding Memory Allocation and File System Caching in OpenSolaris

Yes, I'm guilty. While the workings of the Solaris virtual memory system prior to Solaris 8 are documented quite well, I've not written much about the new cyclic file system cache. There will be quite a bit on this subject in the new Solaris Internals edition, but to answer a giant FAQ in a somewhat faster mode, I'm posting this overview.

A quick Introduction

File system caching has been implemented as an integrated part of the Solaris virtual memory system - since as far back as SunOS 4.0. This has the great advantage of dynamically using available memory as a file system cache. While this has many positive advantages (like being able to speed up some I/O intensive apps by as much as 500x), there were some historic side affects: applications with a lot of file system I/O could swamp the memory system with demand for memory allocations, putting so much pressure that memory pages would be agressively stolen from important applications. Typical symptoms of this condition were that everything seemed to "slow down" when there was file I/O occuring, and the system reported it was constantly out of memory. In Solaris 6 and 7, I updated the paging algorithms to only steal file system pages unless there was a real memory shortage, as part of the feature named "Priority Paging". This meant that although there was still significant pressure from file I/O and high "scan rates", applications didn't get paged out, nor suffer from the pressure. A healthy Solaris 7 system still reported it was out of memory, but performed well.

Solaris 8 - "Cyclic Page Cache"

Starting with Solaris 8, we provided a significant architectual enhancement which provides a more effective solution. The file system cache was changed so that it steals memory from itself, rather than other parts of the system. Hence, a system with a large amount of file I/O will remain in a healthy virtual memory state -- with large amounts of visible free memory, and since the page scanner doesn't need to run, there are no agressive scan rates. Since the page scanner isn't required constantly to free up large amounts of memory, it no longer limits file system related I/O throughput. Other benefits of the enhancement are that applications which want to allocate a large amount of memory can do so by efficiently consuming it directly from the file system cache. For example, starting Oracle with a 50Gbyte SGA now takes less than a minute, compared to the 20-30 minutes with the prior implementation.

The old allocation algorithm

To keep this explantion relatively simple, lets take a brief look at what used to happen with Solaris 7, even with priority paging. The file system consumes memory from the freelists every time a new page is read from disk (or whereever) into the file system. The more pages we read, the more pages depleted from the systems' free list (the central place where memory is kept for reuse). Eventually (sometimes rather quickly), the free memory pool is depleted. At this point, if there is enough pressure, futher requests for new memory pages are blocked until the free memory pool is replenished by the page scanner. The page scanner scans inneficiently though all of memory, looking for pages which it can free up, and slowly refills the free list, but only by enough to satisfy the immediate request, processes resume for a short time, and then stop again as they again run short on memory. The page scanner is a bottleneck in the whole memory life cycle.

In the diagram above, we can see the file system's cache mechanism (segmap) consuming memory from the free list until it's depleted. After those pages are used, they are kept around, but they are only immediately accessable by the file system cache in the direct re-use case; that is, if a file system cache hit occurs, then they can be "reclaimed" back into segmap to avoid a subsequent physical I/O. However, if the file system cache needs a new page, there is no easy way of finding these pages -- rather the page scanner is used to stumble across them. The page scanner effectively "bigles" out the system, blindly looking for new pages to refill the free list. The page scanner has to fill the free list at the same rate as the file system is reading new pages - and is a single point of constraint in the whole design.

The new allocation algorithm

The new algorithm uses a central list to place inactive file cache (that which isn't immediately mapped anywhere), so that they can easily be used to satisfy new memory requests. This is a very subtle change, but with significant demonstrable effects. Firstly, the file system cache now appears as a single age-ordered FIFO: recently read pages are placed on the tail of the list, and new pages are consumed from the head. While on the list, the pages remain as valid cached portions of the file, so if a read cache hit occurs, they are simply removed from where ever they are on the list. This means that pages which are accessed often (cache hit often) are frequently moved to the tail of the list, and only the oldest and least used pages migrate to the head as candidates for freeing.

The cachelist is linked to the freelist, such that if the free list is exhausted then pages will be taken from the head of the cachelist and their contents discarded. New page requests are requested from the freelist, but since this list is often empty, allocations occur mostly from the head of the cache list, consuming the oldest file system cache pages. The page scanner doesn't need to get involved, eliminating the paging bottleneck and the need to run the scanner at high rates (and hence, not wasting CPU either).

If an application process requests a large amount of memory, it too can take from the cachelist via the freelist. Thus, an application can take a large amount of memory from the file system cache without needing to start the page scanner, resulting in substantially faster allocation.

Putting it all together: The Allocation Cycle of Physical Memory

The most significant central pool physical memory is the freelist. Physical memory is placed on the freelist in page-size chunks when the system is first booted and then consumed as required. Three major types of allocations occur from the freelist, as shown above.

Anonymous/process allocations

Anonymous memory, the most common form of allocation from the freelist, is used for most of a process’s memory allocation, including heap and stack. Anonymous memory also fulfills shared memory mappings allocations. A small amount of anonymous memory is also used in the kernel for items such as thread stacks. Anonymous memory is pageable and is returned to the freelist when it is unmapped or if it is stolen by the page scanner daemon.

File system “page cache”

The page cache is used for caching of file data for file systems. The file system page cache grows on demand to consume available physical memory as a file cache and caches file data in page-size chunks. Pages are consumed from the freelist as files are read into memory. The pages then reside in one of three places: - the segmap cache, a process’s address space to which they are mapped, or on the cachelist.

The cachelist is the heart of the page cache. All unmapped file pages reside on the cachelist. Working in conjunction with the cache list are mapped files and the segmap cache.

Think of the segmap file cache as the fast first level file system read/write cache. segmap is a cache that holds file data read and written through the read and write system calls. Memory is allocated from the freelist to satisfy a read of a new file page, which then resides in the segmap file cache. File pages are eventually moved from the segmap cache to the cachelist to make room for more pages in the segmap cache.

The cachelist is typically 12% of the physical memory size on SPARC systems. The segmap cache works in conjunction with the system cachelist to cache file data. When files are accessed - through the read and write system calls, up to 12% of the physical memory file data resides in the segmap cache and the remainder is on the cache list.

Memory mapped files also allocate memory from the freelist and remain allocated in memory for the duration of the mapping or unless a global memory shortage occurs. When a file is unmapped (explicitly or with madvise), file pages are returned to the cache list.

The cachelist operates as part of the freelist. When the freelist is depleted, allocations are made from the oldest pages in the cachelist. This allows the file system page cache to grow to consume all available memory and to dynamically shrink as memory is required for other purposes.

Kernel allocations

The kernel uses memory to manage information about internal system state; for example, memory used to hold the list of processes in the system. The kernel allocates memory from the freelist for these purposes with its own allocators: - the vmem and slab. However, unlike process and file allocations, the kernel seldom returns memory to the freelist; memory is allocated and freed between kernel subsystems and the kernel allocators. Memory is consumed from the freelist only when the total kernel allocation grows.

Memory allocated to the kernel is mostly nonpageable and so, cannot be managed by the system page scanner daemon. Memory is returned to the system freelist proactively by the kernel’s allocators when there is a global memory shortage occurs.

How to observe and monitor the new VM algorithms

The page scanner and its metrics are an important indicator or memory health. If the page scanner is running, there is likely a memory shortage. This is an interesting departure from the behavior you might have been accustomed to on Solaris 7 and earlier, where the page scanner was always running. Since Solaris 8, the file system cache resides on the cachelist, which is part of the global free memory count. Thus, if a significant amount of memory is available, even if it’s being used as a file system cache, the page scanner won’t be running.

The most important metric is the scan rate, which indicates whether the page scanner is running. The scanner starts scanning at an initial rate (slowscan) when freememory falls down to the configured watermark—lotsfree—and then runs faster as free memory gets lower, up to a maximum (fastscan).

We can perform a quick and simple health check by determining whether there is a significant memory shortage. To do this, use vmstat to look at scanning activity and check to see if there is sufficient free memory on the system.

Let’s first look at a healthy system. This system is showing 970 Mbytes of free memory in the free column, and a scan rate (sr) of zero.

sol8# vmstat -p 3
     memory           page          executable      anonymous      filesystem 
   swap  free  re  mf  fr  de  sr  epi  epo  epf  api  apo  apf  fpi  fpo  fpf
 1512488 837792 160 20 12   0   0    0    0    0    0    0    0   12   12   12
 1715812 985116 7  82   0   0   0    0    0    0    0    0    0   45    0    0
 1715784 983984 0   2   0   0   0    0    0    0    0    0    0   53    0    0
 1715780 987644 0   0   0   0   0    0    0    0    0    0    0   33    0    0

Looking at a second case, we can see two of the key indicators showing a memory shortage—both high scan rates (sr > 50000 in this case) and very low free memory (free < 10 Mbytes).

sol8# vmstat -p 3
     memory           page          executable      anonymous      filesystem 
   swap  free  re  mf  fr  de  sr  epi  epo  epf  api  apo  apf  fpi  fpo  fpf
 2276000 1589424 2128 19969 1 0 0    0    0    0    0    0    0    0    1    1
 1087652 388768 12 129675 13879 0 85590 0 0   12    0 3238 3238   10 9391 10630
 608036 51464  20 8853 37303 0 65871 38   0  781   12 19934 19930 95 16548 16591
  94448  8000  17 23674 30169 0 238522 16 0  810   23 28739 28804 56  547  556

Given that the page scanner runs only when the freelist and cachelist are effectively depleted, then any scanning activity is our first sign of memory shortage. Drilling downfuther with ::memstat and shows us where the major allocations are.

sol9# mdb -k
Loading modules: [ unix krtld genunix ip ufs_log logindmux ptm cpc sppp ipc random nfs ]
> ::memstat

Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                      53444               208   10%
Anon                       119088               465   23%
Exec and libs                2299                 8    0%
Page cache                  29185               114    6%
Free (cachelist)              347                 1    0%
Free (freelist)            317909              1241   61%

Total                      522272              2040
Physical                   512136              2000

The categories are described as follows:

Kernel

The total memory used for nonpageable kernel allocations. This is how much memory the kernel is using, excluding anonymous memory used for ancillaries (see Anon).

Anon

The amount of anonymous memory. This includes user process heap, stack and copy-on-write pages, shared memory mappings, and small kernel ancillaries, such as lwp thread stacks, present on behalf of user processes.

Exec and libs

The amount of memory used for mapped files interpreted as binaries or libraries. This is typically the sum of memory used for user binaries and shared libraries. Technically, this memory is part of the page cache, but it is page cache tagged as “executable” when a file is mapped with PROT_EXEC and file permissions include execute permission.

Page cache

The amount of unmapped page cache, that is, page cache not on the cachelist. This category includes the segmap portion of the page cache, and any memory mapped files. If the applications on the system are solely using a read/write path, then we would expect the size of this bucket not to exceed segmap_percent (defaults to 12% of physical memory size). Files in /tmp are also included in this category.

Free (cachelist)

The amount of page cache on the freelist. The freelist contains unmapped file pages and is typically where the majority of the file system cache resides. Expect to see a large cachelist on a system that has large file sets and sufficient memory for file caching.

Free (freelist)

The amount of memory that is actually free. This is memory that has no association with any file or process.

If you want this functionality for Solaris 8, copy the downloadable memory.so into /usr/lib/mdb/kvm/sparcv9, and then use ::load memory before running ::memstat. (Note that this is not a Sun-supported code, but it is considered low risk since it affects only the mdb user-level program).


# wget http://www.solarisinternals.com/si/downloads/memory.so
# cp memory.so /usr/lib/mdb/kvm/sparcv9
# mdb -k
Loading modules: [ unix krtld genunix ip ufs_log logindmux ptm cpc sppp ipc random nfs ]
> ::load memory
> ::memstat

That's it for now.

Technorati Tag: OpenSolaris

Technorati Tag: Solaris

Technorati Tag: mdb

Comments:

Thanks for the explanation of how this works. It seems as though on a machine doing lots of buffered IO that locks acquired while moving pages to the end of the file system cache list would get extremely hot. As the number of spindles or speed of array-based cache increases, this problem would seem to get worse. Is there any merit to this concern, or are the locks and pointer munging efficient enough to not make this much of a concern? Is there a particular lock that I should look for in lockstat output to observe this?

One other question that has bugged me for quite a while is why vmstat's notion of how much swap space is available doesn't match the value reported by swap -l. Document 1181907 says that this difference is not a bug. I have experienced a similar situation (see document 5052904) where vmstat reported more free swap than I had on the machine. Can you elaborate on how (if?) I should interpret the swap column from vmstat?

Posted by Mike Gerdts on May 13, 2005 at 07:17 AM PDT #

The cachelist is actually implemented as a set of lists indexed by a hash, as to avoid contention issues. Check for the pse_mutex's in lockstat. In regards to swap space, there is little relationship between the meaning of swap space reported by vmstat and that reported with swap -l. This is because swap is virtualized in Solaris, and is only allocated from real disk based swap at page-out time. The total available swap available to reserve against is close to RAM + Disk Space. The swap -s command and vmstat report about reservations against virtual swap, and swap -l reports about physical use. When swap -s and vmstat report swap near zero, then memory allocations will start reporting ENOMEM or EAGAIN, often resulting in application failure (how many apps are really coded to check the return value of malloc()? When swap -l reports it's out of physical swap, it just means there's no room for futher page-outs to occur to, resulting in some performance anomolies. I'll post more on this subject if there is futher interest...

Posted by Richard McDougall on May 13, 2005 at 10:20 AM PDT #

Nice article! Adding to the previous swap topic, how would one go about generally translating the vmstat(5) data to say actual processes effecting the stat changes. In particular, as swap or real memory is consumed, how could one identify the target processes responsible for the consumption. I've started to look at the vminfo provider in dtrace and I'm beginning to put something together but was wondering if you guys already have something that can be shared from your toolbox?

Posted by Marc Rocas on May 13, 2005 at 09:30 PM PDT #

I am an application programmer and I don't know much about kernel. our application program run on solaris 10 needs to monitor the use of virtual memory. one second the program caculate the available virtual memory using: struct anoninfo ai; if( swapctl( SC_AINFO, &ai ) < 0 ) { //error; } size_t avail = ai.ani_max - ai.ani_resv; size_t availKb = avail \* m_pageSize; if availKb reaches a LIMIT, we will set a alarm. But when all the virtual memory exhausted, we haven't found any alarm. So Is this calculation wrong?

Posted by Hua Jiang on May 15, 2006 at 06:40 PM PDT #

There are no predetermined alarms on virtual memory exhaustion. Applications will however begin to recieve ENOMEM for system calls attempting to allocate memory. Can you explain more about why you want to measure and potentiall consume all virtual memory? I'm having trouble imagining why this would be useful? Thanks, Richard.

Posted by Richard on May 18, 2006 at 03:43 AM PDT #

Sorry I haven't make this problem clear. Now our application program(run on solaris10)is designed to process calls. Every call costs about 80k virtual memory. So if virtual memory is not enough(for example, 70% is occupied), it needs to refuse new calls then the occupied virtual memory will not grow. Now we need to monitor the virtual memory: if( swapctl( SC_AINFO, &ai ) < 0 ) { //error; } size_t avail = ai.ani_max - ai.ani_resv; size_t availKb = avail \* m_pageSize; if availKb is less than 20%(=1-80%) of total virtual memory, new calls will be refused. our machine run and run, in the end it hung. we check /var/adm, found: WARNING: Sorry, no swap space to grow stack for pid......and etc I am confused why our calculation didn't work.

Posted by Hua Jiang on May 18, 2006 at 08:55 PM PDT #

This was very useful information for me (and thanks for your earlier stuff as well). One question, I see some systems where ::memstat reports large Anon allocations (75% of memory) and vmstat shows very low free memory. Is this Anon allocation accounted for by NFS (diskless clients are being used)? pmap -x of all processes does not account for the Anon memory. I see other similar systems with much lower Anon allocations. The description of "Anon" memory above doesn't seem to capture this(?).

Posted by Randy Smith on May 30, 2006 at 04:20 PM PDT #

<a href="http://vclosets.com">closet organizers</a>

Posted by closet organizers on November 29, 2008 at 03:50 AM PST #

review - she really put herself out there to market it. And now it's being made into a movie starring Julia Roberts -

Posted by 格安レンタカー on March 31, 2010 at 02:27 PM PDT #

Over my next few posts, using these developmental examples from industry leaders in other academic fields, it is my intention to briefly explore these possibilities and limitations for the evoluti

Posted by 海南島 観光 on January 22, 2011 at 07:59 PM PST #

Post a Comment:
  • HTML Syntax: NOT allowed
About

rmc

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today