Best Practices - Dynamic Reconfiguration

This post is one of a series of "best practices" notes for Oracle VM Server for SPARC (formerly named Logical Domains)

Overview of dynamic Reconfiguration

Oracle VM Server for SPARC supports Dynamic Reconfiguration (DR), making it possible to add or remove resources to or from a domain (virtual machine) while it is running. This is extremely useful because resources can be shifted to or from virtual machines in response to load conditions without having to reboot or interrupt running applications. For example, if an application requires more CPU capacity, you can add CPUs to improve performance, and remove them when they are no longer needed. You can use even use Dynamic Resource Management (DRM) policies that automatically add and remove CPUs to domains based on load.

How it works (in broad general terms)

Dynamic Reconfiguration is done in coordination with Solaris, which recognises a hypervisor request to change its virtual machine configuration and responds appropriately. In essence, Solaris receives a message saying "you now have 16 more CPUs numbered 16 to 31" or "8GB more RAM starting at address X" or "here's a new network or disk device - have fun with it". These actions take very little time.

Solaris then can start using the new resource. In the case of added CPUs, that means dispatching processes and potentially binding interrupts to the new CPUs. For memory, Solaris adds the new memory pages to its "free" list and starts using them. Comparable actions occur with network and disk devices: they are recognised by Solaris and then used.

Removing is the reverse process: after receiving the DR message to free specific CPUs, Solaris unbinds interrupts assigned to the CPUs and stops dispatching process threads. That takes very little time.

primary # ldm list
NAME             STATE      FLAGS   CONS    VCPU  MEMORY   UTIL  UPTIME
primary          active     -n-cv-  SP      16    4G       1.0%  6d 22h 29m
ldom1            active     -n----  5000    16    8G       0.9%  6h 59m
primary # ldm set-core 5 ldom1
primary # ldm list
NAME             STATE      FLAGS   CONS    VCPU  MEMORY   UTIL  UPTIME
primary          active     -n-cv-  SP      16    4G       0.2%  6d 22h 29m
ldom1            active     -n----  5000    40    8G       0.1%  6h 59m
primary # ldm set-core 2 ldom1
primary # ldm list
NAME             STATE      FLAGS   CONS    VCPU  MEMORY   UTIL  UPTIME
primary          active     -n-cv-  SP      16    4G       1.0%  6d 22h 29m
ldom1            active     -n----  5000    16    8G       0.9%  6h 59m

Memory pages are vacated by copying their contents to other memory locations and wiping them clean. Solaris may have to swap memory contents to disk if the remaining RAM isn't enough to hold all the contents. For this reason, deallocating memory can take longer on a loaded system. Even on a lightly loaded system it took several 7 or 8 seconds to switch the domain below between 8GB and 24GB of RAM.

primary # ldm set-mem 24g ldom1
primary # ldm list
NAME             STATE      FLAGS   CONS    VCPU  MEMORY   UTIL  UPTIME
primary          active     -n-cv-  SP      16    4G       0.1%  6d 22h 36m
ldom1            active     -n----  5000    16    24G      0.2%  7h 6m
primary # ldm set-mem 8g ldom1
primary # ldm list
NAME             STATE      FLAGS   CONS    VCPU  MEMORY   UTIL  UPTIME
primary          active     -n-cv-  SP      16    4G       0.7%  6d 22h 37m
ldom1            active     -n----  5000    16    8G       0.3%  7h 7m

What if the device is in use?

(this is the anecdote that inspired this blog post)

If CPU or memory is being removed, releasing it pretty straightforward, using the method described above. The resources are released, and Solaris continues with less capacity. It's not as simple with a network or I/O device: you don't want to yank a device out from underneath an application that might be using it. In the following example, I've added a virtual network device to ldom1 and want to take it away, even though it's been plumbed.

primary # ldm rm-vnet vnet19  ldom1
Guest LDom returned the following reason for failing the operation:

                         Resource                                 Information
----------------------------------------------------------  -----------------------
/devices/virtual-devices@100/channel-devices@200/network@1  Network interface net1

VIO operation failed because device is being used in LDom ldom1
Failed to remove VNET instance 

That's what I call a helpful error message - telling me exactly what was wrong. In this case the problem is easily solved. I know this NIC is seen in the guest as net1 so:

ldom1 # ifconfig net1 down unplumb 

Now I can dispose of it, and even the virtual switch I had created for it:

primary # ldm rm-vnet vnet19  ldom1
primary # ldm rm-vsw primary-vsw9 

If I had to take away the device disruptively, I could have used ldm rm-vnet -f but that could disrupt whoever was using it. It's better if that can be avoided.

Summary

Oracle VM Server for SPARC provides dynamic reconfiguration, which lets you modify a guest domain's CPU, memory and I/O configuration on the fly without reboot. You can add and remove resources as needed, and even automate this for CPUs by setting up resource policies.

Taking things away can be more complicated than giving, especially for devices like disks and networks that may contain application and system state or be involved in a transaction. LDoms and Solaris cooperative work together to coordinate resource allocation and de-allocation in a safe and effective way. For best practices, use dynamic reconfiguration to make the best use of your system's resources.

Comments:

Hi,
it sounds to good to be true.... I mean that in 90% of the real life situations this DR depends on the running app... FOE.
If we have a running oracle 10 db on a ldom and if we take some CPU and memory?... I think that the db will crash! Am I wright?

Posted by Del on September 01, 2012 at 06:53 AM MST #

Hi Del,

I guess in the ultimate analysis *everything* depends on the app, doesn't it? If, for example, an app won't benefit from extra RAM or extra CPUs (perhaps it only has a fixed number of runnable process threads) then adding more won't help - whether in DR situation or not.

Now, taking resources away from a running application can be complicated by the fact that some sophisticated applications tune themselves to use all the resources made available to them - and may or may not be programmed to respond to subsequent changes in configuration. (of course, taking CPUs away from an idle application is quite safe) That's in general, but taking the example of Oracle database under Solaris: if you are using Intimate Shared Memory (ISM), you are (this is my understanding - I'm not expert in Oracle DB) locking/pinning pages - so you shouldn't be able to remove them. If you have correctly configured Dynamic Intimate Shared Memory (DISM) you may be able to reduce memory. Again, not my area of expertise, so please consult the relevant documents. Also, it might be that there are things 10g doesn't do in this area that 11g handles.

Posted by Jeff on September 03, 2012 at 12:41 PM MST #

Hi Jeff,

You are absolutely right... *everything* depends on the app ;). In my opinion when we are talking about dynamic resource reallocation (especially for resource removing) we all must think of the whole system (not only the OS, but the app running above, too). :))

Posted by Del on September 04, 2012 at 02:18 AM MST #

Hi Del,

It looks like we're in agreement, and thanks for the dialogue! The rule is "don't try to DR out a resource that is in use". That plays in neatly with the blog example where trying to remove a virtual network device initially failed because it was in use. Any in-use resource has to be released from an application before it can be removed.

For CPUs: easy, since idle applications don't use CPU cycles. For devices: must release them. For memory: at least two different cases: with applications that simply touch memory and then go idle, removing memory might cause swap-out operations if their memory footprints don't fit in the pages remaining after the DR operation. But, if applications lock pages in memory - those pages have be unlocked first.

Posted by Jeff Savit on September 04, 2012 at 07:44 AM MST #

I just learned that if you start Oracle Database without Instance Caging, then changing the number of CPUs can be automatically detected because the database periodically asks the OS how many CPUs it has, and adjusts accordingly. I repeat my disclaimer that I'm not an authority, so please don't ask me for details on DB behavior :-)

Posted by Jeff Savit on September 12, 2012 at 08:26 AM MST #

Post a Comment:
Comments are closed for this entry.
About

jsavit

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today