News, tips, partners, and perspectives for the Oracle Solaris operating system

Oracle Solaris Kernel Zones and Large Pages

One of the things that makes Oracle Solaris unique is how easily and transparently it can deal with different memory page sizes, officially Multiple Page Size Support (MPSS) introduced eons ago in Oracle Solaris 9. This to say the support for using larger memory pages than the standard page size (4KB on x86 and 8KB on SPARC). Anything larger than this smallest page size is seen as a "Large Page".

The reason these large pages are important is for performance. As systems grew memory over time using only the smallest pages to allocate application memory created a huge bottleneck on the memory reference systems in the CPUs (like the TBL) and having a way to reference larger chunks of memory in one go became incredibly important. This resulted in the different CPU architectures supporting different larger page sizes, Where Intel now supports 4KB, 2MB, and 1GB pages, and SPARC 8KB, 64KB, 512KB, 4MB, 32MB, 256MB, 2GB, and 16GB pages. This means that if you have a very large application using many GBs or even TBs of memory you can give it 16GB pages and only use relatively few TLB registers in contrast to many many in the smaller page case — which results in TLB thrashing in the case you're running through the memory in your application. So using large pages can give a huge performance boost to large memory applications. Other OSes allow the use of them too but it's often clunky how to use and configure them in those cases, or require a reboot of the OS to start using them. The ability to dynamically allocate different size pages and no admin steps necessary make Oracle Solaris unique.

However to quote the late great Johan Cruijff "Every advantage has its disadvantage" — well he actually said "Every disadvantage has its advantage" but that's essentially the same — and in the case of large pages this is no different. One of the main problems is that in order to use a page that large you (the kernel) need to free up that amount of memory contiguously, i.e. one large uninterrupted block of memory. So when the application is started it requests this memory and the kernel needs to find it and probably free this memory up. That is to say, move other smaller pages over to somewhere else so you can create the large page.

It's sort of analogous to trying to park a huge truck and needing to move some parked cars to another spot to create one large parking spot for the truck. The scales are just different than with cars and trucks, if the truck/large page is 2GB there could be quite some 64KB or 8KB cars/pages that need to be cleared. Now, this all just takes some time, and if all the cars/pages can be moved it'll just make the starting of the application a little slower. However, sometimes certain cars/pages can't be moved, and in that case that area can't be used for a truck/large page. Even if the total memory would show enough space is available. And the larger the page the higher the chance this happens — This is one of the reasons why we introduced Virtual Memory 2 (a.k.a. VM 2) in Oracle Solaris 11.1 to help deal with this, maybe something for a future blog.

When a system/VM boots hardly any of it's memory is being used so everything is still nice, and pristine, and empty. So allocating a bunch of large pages for an application — say an Oracle Database — has no problems. Like having an near empty parking lot. However if the system has been running for a (long) while you can expect memory to be used all over the place. Not a problem, the applications that are running already have their allocation, all is good. Now if you stop the application for some reason — for example to patch it or so — this contiguous memory is freed up, and while the application isn't there anything can use it. And when the application is restarted the kernel will try to move the "squatters" out again but sometimes can't. Some pages aren't relocatable, for example kernel pages, for example certain ZFS pages. And in that case it could be the kernel can't find the large page size it's looking for and maybe needs to revert to giving smaller pages instead, impacting performance.

Over the different releases of Oracle Solaris (Updates and SRUs) we've introduced more and more techniques to combat this either by reducing the chance that it happens through grouping of page size allocation (in VM 2), or by reducing the use of unmovable pages, or cleaning them up better. Or reducing the area of memory they can use. An example of this is limiting the ZFS ARC cache size.

Now to Kernel Zones. When you boot a Kernel Zone, you're essentially booting a very large application — that happens to be a Type 2 Hypervisor that in turn runs a kernel and applications — and to the hosting kernel is just one application. When we start the Kernel Zone it wants to use the largest size page possible, so that it in turn can give these large pages to it's applications. Going back to the parking lot analogy, think of it as a super large truck, or an event tent or so, that in turn has cars and trucks in it. Again not a problem at the initial boot of the hosting OS, but after a while you might have problems when rebooting you're Kernel Zone, if you've taken it down, runs some applications in the host OS or used the filesystem a lot or so, and want to boot it again.

To help control this Oracle Solaris 11.4 SRU1 introduced Memory Reservation Pools (see the Memory Reservation Pools section of the solaris-kz(7) man page). In this case you can reserve part of the system memory to only be used by Kernel Zones. Nothing else can use this memory, it's blocked off and reserved, essentially guaranteeing that the Kernel Zones can reboot and always have enough memory — as long as you've reserved enough memory for all the Kernel Zones you want to run. Joerg Moellenkamp wrote an excellent blog on how you can use it and see that it's been allocated.

At the last few events we did, we'd been getting questions on this topic and I wanted to write a small blog to point to Joerg's blog, but while writing it realized I wanted to write a bit more of the backstory to hopefully give a better feel for what's going on. Hence the long intro to the link to Joerg's blog.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.