Balloon Thread Code


The following is a quick tour of the balloon thread, which I gave an introduction to last Friday. I'll be quoting bits of code, but to see the whole thing, you should download the source tarball (see this page). The balloon thread is in $SRC/uts/i86xpv/os/balloon.c.


Some terms before we begin:

  • mfn: machine frame number, or the hardware page number of a page
  • pfn: pseudo-physical frame number, or the page number of a page as presented to a domain. The p_pagenum value in the page_t structure holds the pfn.
  • reservation: the amount of memory that is currently allocated to a domain


Most of the code in balloon_init() simply initializes some variables. At the end, we register a Xenbus watch, which will execute once Xenbus is up and running. On domU, this will happen immediately, but on dom0, it must wait until all the userland daemons are running.

In balloon_config_watch(), we register a watch on the domains memory/target location in xenbus. This contains the current memory reservation of the domain. When this changes, the watch will fire and execute the given function, balloon_handler(). Also at this time, the balloon worker thread is created.

Handling Balloon Events

Once the target changes, in balloon_handler(), we check the new target. First, we verify that the target makes sense. I'll cover a bit more of the policy here later. If it makes sense, we set the new_target value in our global structure (protected by a mutex, of course), and signal our cv.

That signal is picked up in balloon_worker_thread(), where this code appears:

                if (bln_stats.bln_new_target != bln_stats.bln_current_pages) {
                         \* We weren't able to fully complete the request
                         \* last time through, so try again.
                        (void) cv_timedwait(&bln_cv, &bln_mutex,
                            lbolt + (bln_wait_sec \* hz));
                } else {
                        cv_wait(&bln_cv, &bln_mutex);

If we're not at our target, we can wait for some amount of time. This was added when I first wrote this code, because I was worried about overloading the domain or hypervisor, but it turned out to be unnecessary. So, bln_wait_sec is 0 by default. It can be changed in /etc/system if you wish to slow down the balloon thread.

Next, if the reservation shrunk, we call balloon_dec_reservation(). If it grew, we call balloon_inc_reservation().

Decreasing the memory reservation

The first significant thing we do in balloon_dec_reservation() is call page_resv(). This makes sure we have enough memory in the system to give away. Next, we need to pick some pages to give away. For dom0, we want to keep low mfn pages for driver dma usage, so we use the function page_get_high_mfn() to pick a page. This finds the page with the highest numbered mfn that is not currently in use. If it fails, we try to free up some memory using kmem_reap() and try again. If that fails, we give up. If one of the tries succeeds, we call balloon_page_add(), which will add the now-invalid page_t structure to a linked list, in case we need it later.

Once that's done, we loop through all pages, calling reassign_pfn() to break the mfn<->pfn mapping. We then call balloon_free_pages(), which will zero the pages and return the pages to the hypervisor.

Increasing the memory reservation

This actually gets more complicated than decreasing memory. To make it simple, I'll first assume that we're doing this on a system that had currently ballooned down, and we're just returning to the original memory reservation.

In balloon_inc_reservation(), we first do a balloon_alloc_pages(), which simply does a hypercall to get the list of mfns from the hypervisor. We then call balloon_page_sub() to get a page_t structure from the list we created earlier. We add it to a new, local list contained by new_list_front/new_list_back. The hypervisor gave us an mfn, and we have the pfn from the page_t structure, so we can now call reassign_pfn() to map the two. We can then loop through new_list_front, calling page_free() to release the page for use by the rest of the system. Finally, we call page_unresv() to update some VM counters.

Increasing the memory reservation beyond the original amount

Here's where things get tricky. What happens if we have no prepared page_t structures to return from balloon_page_sub()? That means we're growing beyond the amount of memory we had at boot. balloon_init_new_pages() was written to handle setting up all the structures we need for new memory. Thanks to Mike Corcoran's insistence during code review, I feel it's pretty well commented, so I'll let that stand on it's own, with only one explanation.

I've mentioned reassign_pfn() several times. That function does more than updating our array mapping pfns to mfns. It also updates the kernel physical mappings (kpm). To explain kpm mappings, I'll quote usr/src/uts/common/vm/seg_kpm.c:

 \* This driver delivers along with the hat_kpm\* interfaces an alternative
 \* mechanism for kernel mappings within the 64-bit Solaris operating system,
 \* which allows the mapping of all physical memory into the kernel address
 \* space at once.
 \* Segkpm mappings have also very low overhead and large pages are used
 \* (when possible) to minimize the TLB and TSB footprint. It is also
 \* extentable for other than Sparc architectures (e.g. AMD64). Main
 \* advantage is the avoidance of the TLB-shootdown X-calls, which are
 \* normally needed when a kernel (global) mapping has to be removed.

kpm mappings are set up at boot. In the balloon thread, if we're dealing with pre-existing pages, xen_kpm_page() calls into the hypervisor to maintain those mappings. However, if we're trying to expand beyond our original reservation, we need to set up new mappings. This can actually require several pages to get all the page tables set up for the new mapping. If we're ballooning up, that's probably because we don't have spare pages for these page tables, which means we may have some trouble.

The process that occurs in this situation is called page stealing. We need a pagetable page, so we'll pick a pagetable page for some userland process, kick out the current mappings, and reclaim that page to set up a new kernel pagetable. That means the userland process will pagefault the next time it access anything using that pagetable. To handle that pagefault, we need another pagetable, so some other pagetable page will be stolen. This continues on and on, and is not a situation you want your machine to be running in.

Unfortunately, I don't think I can have this problem fixed before we want to commit our changes to Nevada, so this will be left to fix until after putback. That means we have some currently-dead code in there, but I hope to enable it soon.

So how do we make sure we don't go above our original reservation? We set the original allocation in bln_stats.bln_max_pages, and use the this code in balloon_handler():

        static uchar_t warning_cnt = 0;
        if (new_target_pages > bln_stats.bln_max_pages) {
                DTRACE_PROBE2(balloon__target__too__large, pgcnt_t,
                    new_target_pages, pgcnt_t, bln_stats.bln_max_pages);
                if (!DOMAIN_IS_INITDOMAIN(xen_info) || warning_cnt != 0) {
                        cmn_err(CE_WARN, "New balloon target (0x%lx pages) is "
                            "larger than original memory size (0x%lx pages). "
                            "Ballooning beyond original memory size is not "
                            new_target_pages, bln_stats.bln_max_pages);
                warning_cnt = 1;
                bln_stats.bln_new_target = bln_stats.bln_max_pages;
        } else {
                bln_stats.bln_new_target = new_target_pages;

DOMAIN_IS_INITDOMAIN() is true when executed on dom0, false otherwise. You may ask why we do not issue the warning on the first attempt on dom0? Well, if you don't give dom0 a dom0_mem kernel option, you will get most of the memory at boot, but Xen will use the balloon thread to give dom0 a bit more memory late in boot. Since we can't add that memory, and we don't want to issue a warning at every boot, we just skip the warning.


Hopefully, I didn't bore everyone to death with the above :-) If I need to explain something better, please post a comment below.


So why is it that I can drop a CPU/memory board into a E2900 and Solaris will dynamically use those resources, but it's not possible to virtually do the same to a Xen domU?

Posted by Me on July 24, 2007 at 01:18 AM PDT #

Because sparc has a 10+ year head start on us :) The process will be similar to the sparc DR code in usr/src/uts/common/os/mem_config.c, but there are enough subtle differences that we just can't copy that code and paste it into balloon.c

Posted by Ryan Scott on July 24, 2007 at 04:26 AM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed



« July 2016