Page Coalescing in Solaris
By mec on Jun 13, 2005
Welcome to OpenSolaris! In this entry, I'll walk through the basics of page coalescing within Solaris. I tend to focus on SPARC since that is my main background, but the logic is actually in common code so it works on both SPARC and x86 architectures. I'll use SPARC in this post as there are four page sizes (8K, 64K, 512K and 4M) which are supported on most SPARC chips.
Before getting into the details of page coalescing, some basic concepts need to be introduced. In Solaris, a page_t is the structure which tracks information for a page of memory, and the base size of memory which is tracked is 8K on SPARC (4K on x86). Now for efficiency reasons we may prefer to use larger page sizes as they may use less hardware resources (I don't have enough time to go into the intricacies of the hardware and their support for large pages, but different processors handle large pages differently.) So, a 64K page is actually a collection of 8K pages for which the first 8K page is aligned on a 64K physical boundary. Inside the page_t this is denoted by the p_szc field, which stands for "page size code." For SPARC, p_szc of 0 corresponds to an 8K page, p_szc of 1 to a 64K page, 2 to 512K and 3 to 4M. The kernel maintains a freelist to hold pages of each size code and pages are demoted or promoted depending upon the needs of the system and thus can move between the different sized freelists.
So, page coalescing is a process used to create large pages from smaller pages. Over time memory in the system gets fragmented and often most free memory ends up on the smallest sized freelist. Now if a request for a 4M page comes in and there are no 4M pages available, the kernel will have to go out and create a page of this size to use. The first step in this process is called page coalescing which will try to coalesce adjacent free pages to create a large free page of the desired size. This is the function of page_freelist_coalesce. If this fails, the kernel can get more aggressive and relocate allocated pages out of a range to create a large free page. I'll get into this in another blog in the future.
Now we get to the good stuff which OpenSolaris has made that much easier to share; how the code works.
Let's look at page_freelist_coalesce via the source browser and you'll find that the function is quite macro heavy. There are a few things to note here about the macros:
PAGE_COUNTERS_\* macros manipulate the page_counters data structure which contains all of the info we need to coalesce free pages into larger free pages. Check out vm_pagelist.c. for the definition of this structure as well as comments on the basics of what the stored data represents. The comment in the code here gives a good description of the data structure which you'll need to read through to understand the rest of the blog.
PGCTRS_CANDS_\*. macros manipulate the page_ctrs_cands data structure which is an optimization on top of the page_counters which essentially keeps track of popcounts for given regions of memory. We'll see how this optimization works shortly.
So, after reading the code and comments above, lets look at the function page_freecnt to get a better understanding of what the data looks like and how it is used. page_freecnt returns the number of free base pagesize (8K for SPARC) pages which exist at page pp for the given szc.
So, let's say that we want to see how many free pages there are within a specific 4M range of memory. We call page_freecnt, and take a look in the page_counters data structures to figure out how many free pages there are. For this example, we'll give only the data which is relevant to that single 4M range of memory, and I'll use some very bad ascii art to help explain:
Region ___ 3 | 6 | --- _______________|_______________ 2 | 0 | 0 | 0 | 0 | 8 | 8 | 5 | 0 | ------------------------------- _|____________________________ 1 ... | 0 | 0 | 0 | 8 | 8 | 7 | 0 | 0| ------------------------------ 0 Assume zeros fill in the rest of the array at region 1.In the above picture, we have Regions 3, 2, 1 and 0 which correspond to the array indexes in the page_counters array:
255 static hw_page_map_t \*page_counters[MMU_PAGE_SIZES];
Now the confusing part :) The above represents 4 512K pages being free, 19 64K pages being free, and 23 8K pages being free for a total of (4\*64)+(19\*8)+23 = 431 8K pages which will be free.
Now lets see how we will calculate this information:
So, in the art above, assume we wanted to create a 4M page which corresponded to the page_counters bin which currently has a 6 in it. Here's how the logic will go:
1. We know there are 6 512K pages which either already exist or can be created from smaller pages. Thus, cnt starts at (6\*64) as it takes 64 8K pages to make a 512K page. So, cnt is currently 384.
2. Now, going to the next level, we see 2 8's. We skip over these as these were already counted in the step above. We don't care what their current size is, all we care is that they can be used to create a 512K page. In addition, we see a 5 at this level, so since it takes 8 8K pages to make a 64K page, we need to add (5\*8), to cnt so cnt is now 424.
3. Now we need to go to the next level, and we see 2 8's again which will have contributed to the 64K page count we calculated above. So, we then look at the 7 which represents 7 8K pages which are free, so we need to add this to cnt and get a grand total of 431 which matches what is actually out there in terms of the real size of the pages.