Monday Oct 13, 2008

Server Virtualization - Creating IO Domains on T5440

If we refer to the System Topology diagram in my previous blog, we find that the internal disks of T5440 are connected to PCIe-0. Hence it is not possible to remove the PCIe-0 from the Primary (or Control) Domain. However it is possible to remove PCIe-1, PCIe-2 and PCIe-3 from the Primary Domain and allocate them to IO Domains.

In order to create a IO-domain using PCIe-1, it has to be removed from Primary Domain. This would cause the Primary Domain to lose its primary network interface if it has been using the On-board NICs. However if there was a network card available on PCIe-0, then the primary network for Primary Domain can be switched to the ports on the network card before removing PCIe-1 from Primary Domain. If an additional network card is not available, it should still be possible to remove PCIe-1 from Primary Domain and create a IO domain (let us call it Secondary Domain) managing devices off PCIe-1. In such a case, the Primary Domain would provide the boot-disk service to the Secondary Domain and the Secondary Domain would provide the primary network service for the Primary Domain. The Pseudo-steps below outlines how this can be done.

  • In the Primary Domain
    • set the number of VCPUs to 8 (this is just an example number of VCPUs)
    • set the memory to 8GB (just an example size of memory)
    • create a vdisk-server
    • remove PCIe-1 from its control
      • This would cause the Primary Domain to lose its network after reboot
    • Reboot the Primary Domain and log back into the Primary Domain from Console
    • To cause VCPUs for Secondary Domain to be allocated from T1 (refer to the Topology Diagaram), create a dummy domain with the rest of 56 VCPUs from T0. Bind the dummy domain.
    • Associate a vdiskserverdevice as the boot-device for Secondary Domain                
    • Create the Secondary Domain
      • set the number of VCPUs to 8
      • set the memory to 8GB
      • add PCIe-1 to it
      • add the vdiskserverdevice as the vdisk for this domain
      • Bind, install-OS and boot the domain
    • Create a vswitch-device on the Secondary Domain
    • Reboot the Secondary Domain
    • Create a vnet-device for the Primary Domain associated with the above vswitch-device
    • Plumb and configure the vnet device on the Primary Domain (assumingthe On-Board network ports are connected to the primary network of the Data Center) Now the Primary Domain should have the primary network available.
    • Remove the dummy domain and proceed with creating other domains.

With the above technique, when the Primary Domain is rebooted, the Secondary Domain may seem to pause until the Primary Domain boots back. Similarly when the Secondary Domain is rebooted, the Primary Domain's primary network may appear to freeze until the Secondary Domain comes back online. But that is far better than losing all the domains and the applications running in those domains.

Server Virtualization - Using LDOMs on T5440

The Sun Fire T5440 can have at most 4 UltraSPARC T2 processor.  Each UltraSPARC-T2 Procesor is directly connected to ┬╝th of the entire system memory with 1Gigabyte memory interleaving and owns a PCIe Root-Complex. When fully populated with Processor and memory, Solaris can see 256 CPUs and 512GB of memory. That is a lot for many applications except for some large databases. With  this class of system, it is not usually possible to consume the entire system with a singe instance of most applications. But that is in fact a very good opportunity to consolidate a bunch of such applications in this system using LDOMs, there-by reducing Power consumption and rack space. An example is the SugarCRM application. It is a web based application written using PHP and has a MySQL database backend. Yun Chew has written a nice blog demonstrating how to consolidate SugarCRM application on this system using LDOMs. I can think of many such applications that can be consolidated on this and T5140 and T5240 based systems.

The work done by Yun referred to above, there was no need to create any IO domains, but because T5440 has 4 PCIe Root-Complex, it is possible to create up to 4 IO domains for applications sensitive to IO performance. Such applications, like database can be run in the IO domain  so that the application can have direct access to the physical disks. The other domains - like application server domains can access the database over virtual NIC. Each of the application server domains can have another virtual NIC to communicate with the external world.

The good thing about LDOMs based virtualization is that, even if the Primary Domain goes down, other domains continue to be functional. Many other virtualization technology does not have this advantage, which is why Live Migration is very critical for such virtualization technology.

To get the best performance out of a LDOMs based application deployment, it is important to understand the system topology a bit so that it becomes easier to determine what to place where. I have tried to create a sketch of the system topology below for reference.


When creating domains, IO and CPU requirement for the applications that would run in the virtualized environment should be estimated. The IO-performance of virtualized 1Gig network and virtualized disk is same as native. But compared to native-IO, virtualized-IO consumes more CPU cycles, often in the range of 5%-25%, depending on the size and frequency of the IO. Hence, when doing resource planning for LDOMs environment, couple of points should be considered to get the best performance from the T5440 LDOMs environment.

  • Is the application CPU intensive?
    • Does it scale up with additional CPUs?
  • Is the application Disk or Network IO intensive?
    • Moderately IO intensive applications would consume less than 50% of maximum IO capacity of the device
  • Is the application both CPU and IO intensive?
  • How many interrupt sources the domain would need to manage?
    • PCIe based Fiber Channel HBAs normally have 2 interrupt source.
    • PCIe based 1G network devices have either 1 or 2 interrupt sources, while 10G network devices have 8 interrupt sources
    • Each virtualized IO device created out of vsw or vds have 1 interrupt sources

The number of VCPUs that need to be allocated to a Domain depends largely on the ability of the application to make good use of the VCPUs.  In addition to the VCPUs needed by the application, extra VCPUs should be  allocated to handle interrupts.  For optimal performance, when VCPUs are allocated to a domain, then they should be allocated in multiples of 4 at least, preferably in multiples of 8 where possible.

In the next section I will describe how to create IO domain with Inter-IO Domain Dependency

Server Virtualization - LDOMs

With the introduction of Chip Multi-Threading (CMT) in the SPARC Processor Family, a new sun4v based architecture was also introduced. 


This sun4v interface allows the Operating System to communicate with the hardware via a layer called the Hypervisor. The Hypervisor provides a  Hardware Abstraction to the Operating System. The Hypervisor itself is not an Operating System and is delivered with the platform bundled with the Firmware. Now it should be possible to carve out different groups  of actual Hardware components and present it to the Operating System.


This combination of the Hypervisor and sun4v based Operating System are the key enablers for LDOMs. LDOMs is supported on all UltraSPARC T1 and UltraSPARC T2 based system. There are some nice documents
about LDOMs  including discussion forums that you can join or post your questions.

LDOMs Concept

A UltraSPARC T1 processor is equipped with up to 8 cores, with 4 Hardware Threads (Strands) per core. Each Hardware Thread is seen as a CPU by the Operating System. A UltraSPARC T2 Processor is also equipped with up to 8 cores  per chip with 8 Hardware Threads per core.

When creating domains, CPUs are allocated to a domain. A CPU allocated to one domain cannot be shared with another domain. Similarly when memory is allocated to a domain, the same memory cannot be allocated to another domain. Hence CPU and memory are partitioned  across domains.  However, the IO  devices  like network cards  or disks can be shared. When sharing disks, a single slice of a disk cannot be shared with multiple domains, however different slices of a disk can be allocated to different domains. It is also possible to create large files on a mounted filesystem and make a file available to a domain as disk.

UltraSPARC T1 based T2000 have 2 PCI-e Root-Complex,  UltraSPARC T2 based T5120 and T5220 have 1 PCI-e Root-Complex along with 2xOn-Chip 10Gigabit Ethernet,  UltraSPARC T2 Plus based T5140 and T5240 also have 2 PCI-e Root-Complex, and T5440 has 4 PCI-e Root Complex. It is possible to allocate a Root-Complex to a Guest Domain so that the Guest has direct access to the devices connecting to the Root-Complex.

LDOMs Components

  • Primary Domain - This is default or the first domain that is available with a new system. Initially all system resources remain allocated to this domain. This is the only domain that can be used to configure other domains. This Domain is sometimes referred as Control Domain.
  • Service Domain - A domain that provides disk and network services to other domains. For example, if a Guest Domain makes a Disk Image stored in its filesystem available for booting another domain, then it can be called a Service Domain
  • IO Domain - A domain that owns  physical  IO devices. When such domain shares its devices  with another domain , it can also be terms as Service Domain
  • Guest Domain - A domain that depends on any of the above three domains for its IO services.
  • Virtual Disk Client (vdc) - A device driver component active in Guest Domain to provide disk view to the domain
  • Virtual Disk Server (vds) - A device driver component active in Service Domain, that is responsible for the physical IO after receiving requests from the vdc.
  • Virtual Network Client (vnet) - Similar to vdc above, but provide Virtual NIC service to the Guest
  • Virtual Network Switch (vsw) - A switch implementation that communicates with vnet on one side and and with the NIC device-driver  on the other side.
  • Virtual Console Concentrator  (vcc) -  Provide Console access to a Guest Domain
  • MAU - These are the On-Chip Cryptographic Co-Processors. There is 1 MAU per core.

Steps for Creating a Domain

  1. Some CPU and Memory resources from the Primary Domain must be removed so that it can be allocated to other domains
  2. A vcc instance need to be created in the Primary Domain
  3. A vsw and vds (Virtual Disk Server Device) instance need to be created
  4. At this time a Guest Domain can be created
    1. It should be assigned a Console Port (vcc)
    2. Its vdc should be associated with a Virtual Disk Service
    3. Its vnet should be associated with a vsw

Tony Shoumack wrote a nice blueprint to provide detailed help with domain creation using LDOMs.

The per core FPU of UltraSPARC T2 and UltraSPARC T2 Plus are just functional units of the core. When a Domain need to execute Floating Point instruction, the core associated with the Domain takes care of it.
If the Domain need to accelerate Cryptographic Operations by offloading it to the On-Chip Cryptographic Co-Processor, then, MAUs need to be assigned to the domain.

In the next section, I will cover how to allocate devices and CPU to get the best performance.

Server Virtualization - Techniques

In my previous blog I discussed share-management of the resources. Over the years, the system resources that get managed evolved, requiring the share-management to get complex. Not long ago, VT100 type dumb terminals connected to serial line concentrators was a popular technique to share a server system among end users. With demand for Desktop Graphics, Sun invented the SunRay technology allowing thousands of Graphics Desktops to be concentrated on few servers without having Graphics Cards. The basic share-management software in these technologies are similar to what a Multi-User Operating System provides.

With increasing demand for Name-Space and Configuration isolation, Sun created Zones (a.k.a Solaris Containers), an Operating System level lightweight Virtualization technology.  Each Zone represents a whole system with its own Name-Space and Configuration that can be different from another Zone.  Zones share the same kernel on given system. But a special type of Zone, called Branded Zones allows running Solaris 8 and Solaris 9 Operating Systems instances to be run on Solaris 10. Branded Zones created on Solaris 10 x86 Operating Environment can also run 32bit Linux OS. CPU and memory resources can be shared or dedicated. A new type of scheduler called Fair Share Scheduler helps maintain balance of CPU usage among the Zones.

From the above, it is evident that at least some system resource must get shared with an active share management in place for a setup to be termed as Virtualized. The resources are

  • CPU - can be dedicated or shared among the Domains.
  • Memory - is normally not shared, but in case it gets shared among Domains, it can lead to performance penalties
  • IO - can also be shared or dedicated at a leaf level or an entire IO subsystem can be dedicated to a Domain.

These new sharing requirements, introduces a concept of an arbitrator, which owns all the resources of the system and allows access to these resources. This arbitrator is called the Hypervisor. Traditionally a CPU executes instructions either in user-mode or in super-user-mode. But with multiple Domains accessing the the same CPU a new mode need to be introduced - Hyper-Privileged mode. This mode is assigned to the Hypervisor. The location and exact role played by the Hypervisor in allocating, dedicating or sharing resources among Domains differentiates one Virtualization technology from another. Some Hypervisors are extensions to existing kernels while other Hypervisors are part of the System Firmware.

When a IO device is shared by multiple domains, a Proxy mechanism is normally used. The Proxy performs the task of actual IO on behalf of the Guest Domain. The Guest communicates with the Proxy over channels. The channels are allocated and maintained by the Hypervisor. The actual functionality provided by the channel is dependent on the Virtualization technology used. The Hypervisor is often also responsible for managing the IO space between the Guest Domains and the Proxy. It sometimes perfom the task of copying the data from one IO space to another, or grant access to a piece of memory belonging to a Domain or Proxy to another Domain or Proxy so that it can relive itself from doing the actual copy. This copy  can sometime pose as extra overhead and often is the source of reduced Virtualized IO performance when compared to Native IO performance. New features in the PCI-Express subsystem allow a Guest Domain to directly do IO with the physical device. This advancement in PCI-Express subsystem has led the Virtualization Technology providers to come up with two new solutions viz. Direct-IO and IOV. I will go into the details of these later.

It is apparent from above, that the Guest Operating System needs to be modified to some extent to allow it to communicate with the Proxy. When the Guest Operating System needs modification or is made Virtualization-aware, it is called Para-Virtualization. But it also possible to emulate an entire computer system and present it to the Guest Operating System. At minimum, if the IO susbsytem is emulated, then it is  possible to run a Guest Domain with un-modified Native Operating System. This is often termed as Full Virtualization. Because this technique involves lot of emulation, its performance often lags that of  Para-Virtualized domains. Performance acceleration requires help from the hardware and is termed as Hardware Assisted Virtualization.

In this new Virtualization space, Sun offers two solutions - xVM Server for x86 Platform and LDOMs for the SPARC  Platform.

In the next section, I will write about LDOMs. 

Server Virtualization - Concepts

Over the past couple of years, I have been visiting customers ranging  from enterprise to small and mid size web-tier companies and any  customer in between except Telco customers in an attempt to understand  how Virtualization can help them. I find several use cases and some  confusion.

Why Deploy Virtualization?

  • It can be deployed in a way to meet employee desktop requirement, thereby reducing the administrative overhead of managing lot of single systems
  • CPUs have become faster with more cores per socket; many  application are unable to take advantage of all the cores on these  powerful chips. So customers want to run multiple applications on  these systems without affecting specialized and some times conflicting configuration requirements needed for each application. 
  • Data centers with lot of client facing servers are consuming too much power distorting the power-throughput ratio significantly. Hence customers want to collapse these servers in to few large systems
  • It should be able to run old applications on newer hardware

The term Virtualization span several technologies that the industry has offered so far. At a very basic level, I would classify Virtualization Technology as a way to share the resources of a single server system across multiple users. The users can be human beings and or application software. The resources are generally the components that make up a computer viz. CPU, Memory, IO and Display/Input devices. Hence at a very basic level, the Virtualization technology must actively participate in share-management of these resources.

I should make a distinction here between Partitioning a large system (for eg. Domains of a Sun Fire 6900) into Domains, where  none of the above resources are shared (also sometime referred to as Hard Partitioning), and, Server Virtualization (the term Domain is also used in this context) where at least some or all of the resources are shared with active share-management in place. The term Domain encompasses the above set of resources with some settings viz. OS, patches, tunables, software - that can be different from another Domain and yet co-exist in the same Physical system.

While Virtualization is just one way of doing Server Consolidation, the availability of so many Virtualization Technologies with so many options  have created some mis-conceptions and confusions.

  • Can I really re-deploy my application on a new domain without affecting the end-user experience. The short answer is no, unless it was also possible to do so in a non-Virtualized environment.
  • I can create lot of domains in a single system creating a situation where CPU and Memory can be over-subscribed. What about performance?
  • Can I move my Guest from one Virtualization technology to another? The answer depends on several factors
    • The Instruction Set supported by the Source system and target system should be same, unless the Virtualization technology on the target system can emulate Source System Instruction Set.
    • If the Guest Operating system is un-aware that it is running in a virtualized environment, then it should be possible to do such migrations.
  • I can setup or move domains around and bring up my application in no time. Some times that is true; it will depend on how complex the IO environment is.
  • I can use the same Disk device or Boot Image to boot all my domains. Sharing a Boot Image is not possible in Virtualization. Sharing a Physical Disk is possible with some caveats.
Next, I plan to write about some  Virtualization technologies that are available today. It should help clear some confusions and mis-conceptions.



« August 2016