Dienstag Mai 20, 2014

Improved vDisk Performance for LDoms

In all the LDoms workshops I've been doing in the past years, I've always been cautioning customers to keep their expectations within reasonable limits when it comes to virtual IO.  And I'll not stop doing that today.  Virtual IO will always come at a certain cost, because of the additional work necessary to translate physical IOs to the virtual world.  Until we invent time travel, this will always need some additional time to be done.  But there's some good news about this, too:

First, in many cases the overhead involved in virtualizing IO isn't that much - the LDom implementation is very efficient.  And in many of these many cases, it doesn't hurt.  Often, because the workload involved doesn't care and virtual IO is fast enough.

Second, there are good ways to configure virtual IO, and not so good ways.  If you stick to the good ways (which I previously discussed here), you'll increase the number of cases where virtual IO is more than just good enough. 

But of course, there are always those other cases where it just isn't.  But there's more good news, too:

For virtualized network, we've introduced a new implementation utilizing large segment offload (LSO) and some other techniques to increase throughput and reduce latency to a point where virtual networking has gone away as a reason for performance issues.  This was in LDoms release 3.1.  Now is when we introduce a similar enhancement for virtual disk.

When we talk about disk IO and performance, the most important configuration best practice is to spread IO load to multiple LUNs.  This has always been the case, long before we started to even think about virtualization.  The reason for this is the limited number of IOPS a single LUN will deliver.  Whether that LUN is a single physical disk or a volume in a more sophisticated disk array doesn't matter.  IOPS delivered by one LUN are limited, and IOs will queue up in this LUN's queue in a very sequential manner.  A single physical disk might deliver 150 IOPS, perhaps 300 IOPS.  A SAN LUN with a strong array in the backend might deliver 5000 IOPS or a little more.  But that isn't enough, and has never been.  Disk striping of any kind was invented to solve this problem.  And virtualization of both servers and storage doesn't change the overall picture.  Which means that in LDoms, the best practice has always been to configure several LUNs, which means several vdisks, into a single guest system.  This often provided the required IO performance, but there were quite a few cases where this just wasn't good enough and people had to move back to physical IO.  Of course, there are several ways to provide physical IO and still virtualize using LDoms, but the situation was not ideal. 

With the release of Solaris 11.1 SRU 19 (and a Solaris 10 patch shortly afterwards) we are introducing a new implementation of the vdisk/vds software stack, which significantly improves both latency and throughput of virtual disk IO.  The improvement can best be seen in the graphs below.

This first graph shows the overall number of IOPS during a performance test, comparing bare metal with the old and the new vdisk implementation. As you can see, the new implementation delivers essentially the same performance as bare metal, with a variation that might as well be statistical deviation. Note that these tests were run on a total of 28 SAN LUNs, so please don't expect a single LUN to deliver 150k IOPS anytime soon :-) The improvement over the old implementation is significant, with differences of up to 55% in some cases. Again, note that running only a single stream of IOs against a single LUN will not show as much of an improvement as running multiple streams (denoted as threads in the graphs). This is due to the fact that parts of the new implementation have focused on de-serializing the IO infrastructure, something you'll not notice if you run single threaded IO streams. But then, most IO hungry applications issue multiple IOs.  Likewise, if your storage backend can't provide this kind of performance (perhaps because you're testing on a single, internal disk?), don't expect much change! 

So we know that throughput has been fixed (with 150k IOPS and 1.1 GB/sec virtual IO in this test, I believe I can safely say so). But what about IO latency? This next graphs shows a similar improvement here:

Again, response time (or service time) with the new implementation is very similar to what you get from bare metal.  The maximum difference is in the 2 thread case with less than 4% difference between virtual IO and bare metal.  Close enough to actually start talking about zero overhead IO (at least as far as the IO performance is concerned).  Talking about overhead:  I sometimes call the overhead involved in virtualization the "Virtualization Tax" - the resources you invest in virtualization itself, or, in other words, the performance (or response time) you lose because of virtualization.  In the case of LDoms disk IO, we've just seen a signifcant reduction in virtualization taxes:

The last graph shows how much higher the response time for virtual disk IO was with the old implementation, and how much of that we've been given back by this charming piece of engineering in the new implementation. Where we paid up to 55% of virtualization tax before, we're now down to 4% or less. A big "Thank you!" to engineering!

Of course, there's always a little disclaimer involved:  Your milage will vary.  The results I show here were obtained on 28 LUNs coming from some kind of FC infrastructure.  The tests were done using vdbench in a read/write mix of 60%/40% running from 2 to 20 threads doing random IO.  While this is quite a challenging load for any IO subsystem and represents the load pattern that showed the highest virtualization tax with the old implementation, this still means that real world benefits from this new implementation might not achieve the same improvements.  Although I am very optimistic that they will be similar.

In conclusion, with the new, improved virtual networking and virtual disk IO that are now available, the range of applications that can safely be run on fully virtualized IO has been expanded significantly.  This is in line with the expectations I often find in customer workshops, where high end performance is naturally expected from SPARC systems under all circumstances.

Before I close, here's how to use this new implementation:

  • Update to Solaris 11.1 SRU 19 in
    • all guest domains that want to use the new implementation.
    • all IO domains that provide virtual disks to these guests
    • This will also update LDoms Manager to 3.1.1
    • If only one in the pair (guest|IO domain) is updated, virtual IO will continue to work using the old implementation.
  • A patch for Solaris 10 will be available shortly.

Update 2014-06-16: Patch 150400-13 has now been released for Solaris 10.  See the Readme for details.

Donnerstag Mrz 27, 2014

A few Thoughts about Single Thread Performance


[Read More]

Montag Feb 24, 2014

What's up with LDoms: Part 8 - Physical IO

Virtual IO SetupFinally finding some time to continue this blog series...  And starting the new year with a new chapter for which I hope to write several sections: Physical IO options for LDoms and what you can do with them.  In all previous sections, we talked about virtual IO and how to deal with it.  The diagram at the right shows the general architecture of such virtual IO configurations. However, there's much more to IO than that. 

From an architectural point of view, the primary task of the SPARC hypervisor is partitioning of  the system.  The hypervisor isn't usually very active - all it does is assign ownership of some parts of the hardware (CPU, memory, IO resources) to a domain, build a virtual machine from these components and finally start OpenBoot in that virtual machine.  After that, the hypervisor essentially steps aside.  Only if the IO components are virtual components, do we need hypervisor support.  But those IO components could also be physical.  Actually, that is the more "natural" option, if you like.  So lets revisit the creation of a domain:

We always start with assigning of CPU and memory in some very simple steps:

root@sun:~# ldm create mars
root@sun:~# ldm set-memory 8g mars
root@sun:~# ldm set-core 8 mars

If we now bound and started the domain, we would have OpenBoot running and we could connect using the virtual console.  Of course, since this domain doesn't have any IO devices, we couldn't yet do anything particularily useful with it.  Since we want to add physical IO devices, where are they?

To begin with, all physical components are owned by the primary domain.  This is the same for IO devices, just like it is for CPU and memory.  So just like we need to remove some CPU and memory from the primary domain in order to assign these to other domains, we will have to remove some IO from the primary if we want to assign it to another domain.  A general inventory of available IO resources can be obtained with the "ldm ls-io" command:

root@sun:~# ldm ls-io
NAME                                      TYPE   BUS      DOMAIN   STATUS  
----                                      ----   ---      ------   ------  
pci_0                                     BUS    pci_0    primary          
pci_1                                     BUS    pci_1    primary          
pci_2                                     BUS    pci_2    primary          
pci_3                                     BUS    pci_3    primary          
/SYS/MB/PCIE1                             PCIE   pci_0    primary  EMP     
/SYS/MB/SASHBA0                           PCIE   pci_0    primary  OCC
/SYS/MB/NET0                              PCIE   pci_0    primary  OCC     
/SYS/MB/PCIE5                             PCIE   pci_1    primary  EMP     
/SYS/MB/PCIE6                             PCIE   pci_1    primary  EMP     
/SYS/MB/PCIE7                             PCIE   pci_1    primary  EMP     
/SYS/MB/PCIE2                             PCIE   pci_2    primary  EMP     
/SYS/MB/PCIE3                             PCIE   pci_2    primary  OCC     
/SYS/MB/PCIE4                             PCIE   pci_2    primary  EMP     
/SYS/MB/PCIE8                             PCIE   pci_3    primary  EMP     
/SYS/MB/SASHBA1                           PCIE   pci_3    primary  OCC     
/SYS/MB/NET2                              PCIE   pci_3    primary  OCC     
/SYS/MB/NET0/IOVNET.PF0                   PF     pci_0    primary          
/SYS/MB/NET0/IOVNET.PF1                   PF     pci_0    primary          
/SYS/MB/NET2/IOVNET.PF0                   PF     pci_3    primary          
/SYS/MB/NET2/IOVNET.PF1                   PF     pci_3    primary

The output of this command will of course vary greatly, depending on the type of system you have.  The above example is from a T5-2.  As you can see, there are several types of IO resources.  Specifically, there are

  • BUS
    This is a whole PCI bus, which means everything controlled by a single PCI control unit, also called a PCI root complex.  It typically contains several PCI slots and possibly some end point devices like SAS or network controllers.
  • PCIE
    This is either a single PCIe slot.  In that case, it's name corresponds to the slot number you will find imprinted on the system chassis.  It is controlled by a root complex listed in the "BUS" column.  In the above example, you can see that some slots are empty, while others are occupied.  Or it is an endpoint device like a SAS HBA or network controller.  An example would be "/SYS/MB/SASHBA0" or "/SYS/MB/NET2".  Both of these typically control more than one actual device, so for example, SASHBA0 would control 4 internal disks and NET2 would control 2 internal network ports.
  • PF
    This is a SR-IOV Physical Function - usually an endpoint device like a network port which is capable of PCI virtualization.  We will cover SR-IOV in a later section of this blog.

All of these devices are available for assignment.  Right now, they are all owned by the primary domain.  We will now release some of them from the primary domain and assign them to a different domain.  Unfortunately, this is not a dynamic operation, so we will have to reboot the control domain (more precisely, the affected domains) once to complete this.

root@sun:~# ldm start-reconf primary
root@sun:~# ldm rm-io pci_3 primary
root@sun:~# reboot
[ wait for the system to come back up ]
root@sun:~# ldm add-io pci_3 mars
root@sun:~# ldm bind mars

With the removal of pci_3, we also removed PCIE8, SYSBHA1 and NET1 from the primary domain and added all three to mars.  Mars will now have direct, exclusive access to all the disks controlled by SASHBA1, all the network ports on NET1 and whatever we chose to install in PCIe slot 8.  Since in this particular example, mars has access to internal disk and network, it can boot and communicate using these internal devices.  It does not depend on the primary domain for any of this.  Once started, we could actually shut down the primary domain.  (Note that the primary is usually the home of vntsd, the console service.  While we don't need this for running or rebooting mars, we do need it in case mars falls back to OBP or single-user.) 

Root Domain SetupMars now owns its own PCIe root complex.  Because of this, we call mars a root domain.  The diagram on the right shows the general architecture.  Compare this to the diagram above!  Root domains are truely independent partitions of a SPARC system, very similar in functionality to Dynamic System Domains in the E10k, E25k or M9000 times (or Physical Domains, as they're now called).  They own their own CPU, memory and physical IO.   They can be booted, run and rebooted independently of any other domain.  Any failure in another domain does not affect them.  Of course, we have plenty of shared components: A root domain might share a mainboard, a part of a CPU (mars, for example, only has 2 cores...), some memory modules, etc. with other domains.  Any failure in a shared component will of course affect all the domains sharing that component, which is different in Physical Domains because there are significantly fewer shared components.  But beyond this, root domains have a level of isolation very similar to that of Physical Domains.

Comparing root domains (which are the most general form of physical IO in LDoms) with virtual IO, here are some pros and cons:

Pros:

  • Root domains are fully independet of all other domains (with the exception of console access, but this is a minor limitation).
  • Root domains have zero overhead in IO - they have no virtualization overhead whatsoever.
  • Root domains, because they don't use virtual IO, are not limited to disk and network, but can also attach to tape, tape libraries or any other, generic IO device supported in their PCIe slots.

Cons:

  • Root domains are limited in number.  You can only create as many root domains as you have PCIe root complexes available.  In current T5 and M5/6 systems, that's two per CPU socket.
  • Root domains can not live migrate.  Because they own real IO hardware (with all these nasty little buffers, registers and FIFOs), they can not be live migrated to another chassis.

Because of these different characteristics, root domains are typically used for applications that tend to be more static, have higher IO requirements and/or larger CPU and memory footprints.  Domains with virtual IO, on the other hand, are typically used for the mass of smaller applications with lower IO requirements.  Note that "higher" and "lower" are relative terms - LDoms virtual IO is quite powerful.

This is the end of the first part of the physical IO section, I'll cover some additional options next time.  Here are some links for further reading:

Donnerstag Jul 04, 2013

What's up with LDoms: Part 7 - Layered Virtual Networking

Back for another article about LDoms - today we'll cover some tricky networking options that come up if you want to run Solaris 11 zones in LDom guest systems.  So what's the problem?

MAC Tables in an LDom systemLet's look at what happens with MAC addresses when you create a guest system with a single vnet network device.  By default, the LDoms Manager selects a MAC address for the new vnet device.  This MAC address is managed in the vswitch, and ethernet packets from and to that MAC address can flow between the vnet device, the vswitch and the outside world.  The ethernet switch on the outside will learn about this new MAC address, too.  Of course, if you assign a MAC address manually, this works the same way.  This situation is shown in the diagram at the right.  The important thing to note here is that the vnet device in the guest system will have exactly one MAC address, and no "spare slots" with additional addresses. 

Add zones into the picture.  With Solaris 10, the situation is simple.  The default behaviour will be a "shared IP" zone, where traffic from the non-global zone will use the IP (and thus ethernet) stack from the global zone.  No additional MAC addresses required.  Since you don't have further "physical" interfaces, there's no temptation to use "exclusive IP" for that zone, except if you'd use a tagged VLAN interface.  But again, this wouldn't need another MAC address.


MAC Tables in previous versionsWith Solaris 11, this changes fundamentally.  Solaris 11, by default, will create a so called "anet" device for any new zone.  This device is created using the new Solaris 11 network stack, and is simply a virtual NIC.  As such, it will have a MAC address.  The default behaviour is to generate a random MAC address.  However, this random MAC address will not be known to the vswitch in the IO domain and to the vnet device in the global zone, and starting such a zone will fail.


MAC Tables in version 3.0.0.2The solution is to allow the vnet device of the LDoms guest to provide more than one MAC address, similar to typical physical NICs which have support for numerous MAC addresses in "slots" that they manage.  This feature has been added to Oracle VM Server for SPARC in version 3.0.0.2.  Jeff Savit wrote about it in his blog, showing a nice example of how things fail without this feature, and how they work with it.  Of course, the same solution will also work if your global zone uses vnics for purposes other than zones.

To make this work, you need to do two things:

  1. Configure the vnet device to have more than one MAC address.  This is done using the new option "alt-mac-addrs" with either ldm add-vnet or ldm set-vnet.  You can either provide manually selected MAC addresses here, or rely on LDoms Manager to use it's MAC address selection algorithm to provide one.
  2. Configure the zone to use the "auto" option instead of "random" for selecting a MAC address.  This will cause the zone to query the NIC for available MAC addresses instead of coming up with one and making the NIC accept it.

I will not go into the details of how this is configured, as this is very nicely covered by Jeff's blog entry already.  I do want to add that you might see similar issues with layered virtual networking in other virtualization solutions:  Running Solaris 11 vnics or zones with exclusive IP in VirtualBox, OVM x86 or VMware will show the very same behaviour.   I don't know if/when these thechnologies will provide a solution similar to what we now have with LDoms.

Dienstag Jun 18, 2013

A closer look at the new T5 TPC-H result

You've probably all seen the new TPC-H benchmark result for the SPARC T5-4 submitted to TPC on June 7.  Our benchmark guys over at "BestPerf" have already pointed out the major takeaways from the result.  However, I believe there's more to make note of.

Scalability

TPC doesn't promote the comparison of TPC-H results with different storage sizes.  So let's just look at the 3000GB results:

  • SPARC T4-4 with 4 CPUs (that's 32 cores at 3.0 GHz) delivers 205,792 QphH.
  • SPARC T5-4 with 4 CPUs (that's 64 cores at 3.6 GHz) delivers 409,721 QphH.

That's just a little short of 100% scalability, if you'd expect a doubling of cores to deliver twice the result.  Of course, one could expect to see a factor of 2.4, taking the increased clockrate into account.  Since the TPC does not permit estimates and other "number games" with TPC results, I'll leave all the arithmetic to you.  But let's look at some more details that might offer an explanation.

Storage

Looking at the report on BestPerf as well as the full disclosure report, they provide some interesting insight into the storage configuration.  For the SPARC T4-4 run, they had used 12 2540-M2 arrays, each delivering around 1.5 GB/s for a total of 18 GB/s.  These were obviously directly connected to the 24 8GBit FC ports of the SPARC T4-4, using two cables per storage array.  Given the 8GBit ports of the 2540-M2, this setup would be good for a theoretical maximum of 2GB/sec per array.  With 1.5GB/sec actual throughput, they were pretty much maxed out. 

In the SPARC T5-4 run, they report twice the number of disks (via expansion trays for the 2540-M2 arrays) for a total of 33GB/s peak throughput, which isn't quite 2x the number achieved on the SPARC T4-4.  To actually reach 2x the throughput (36GB/s), each array would have had to deliver 3 GB/sec over its 4 8GBit ports.  The FDR only lists 12 dual-port FC HBAs, which explains the use of Brocade FC switches: Connecting all 4 8GBit ports of the storage arrays and using the FC switch to bundle that into 24 16GBit HBA ports.  This delivers the full 48x8GBit FC bandwidth of the arrays to the 24 FC ports of the server.  Again, the theoretical maximum of 4 8GBit ports on each storage array would be 4 GB/sec, but considering all the protocol and "reality overhead", the 2.75 GB/sec they actually delivered isn't bad at all.  Given this, reaching twice the overall benchmark performance is good.  And a possible explanation for not going all the way to 2.4x. Of course, other factors like software scalability might also play a role here.

By the way - neither the SPARC T4-4 nor the SPARC T5-4 used any flash in these benchmarks. 

Competition

Ever since the T4s are on the market, our competitors have done their best to assure everyone that the SPARC core still lacks in performance, and that large caches and high clockrates are the only key to real server performance.  Now, when I look at public TPC-H results, I see this:

TPC-H @3000GB, Non-Clustered Systems
System QphH
SPARC T5-4
3.6 GHz SPARC T5
4/64 – 2048 GB
409,721.8
SPARC T4-4
3.0 GHz SPARC T4
4/32 – 1024 GB
205,792.0
IBM Power 780
4.1 GHz POWER7
8/32 – 1024 GB
192,001.1
HP ProLiant DL980 G7
2.27 GHz Intel Xeon X7560
8/64 – 512 GB
162,601.7

So, in short, with the 32 core SPARC T4-4 (which is 3 GHz and 4MB L3 cache), SPARC T4-4 delivers more QphH@3000GB than IBM with their 32 core Power7 (which is 4.1 GHz and 32MB L3 cache) and also more than HP with the 64 core Intel Xeon system (2.27 GHz and 24MB L3 cache).  So where exactly is SPARC lacking??

Right, one could argue that both competing results aren't exactly new.  So let's do some speculation:

IBM's current Performance Report lists the above mentioned IBM Power 780 with an rPerf value of 425.5.  A successor to the above Power 780 with P7+ CPUs would be the Power 780+ with 64 cores, which is available at 3.72 GHz.  It is listed with an rPerf value of 690.1, which is 1.62x more.  So based on IBM's own performance estimates, and assuming that storage will not be the limiting factor (IBM did test with 177 SSDs in the submitted result, they're welcome to increase that to 400) they would not be able to double the performance of the Power7 system.  And they'd need more than that to beat the SPARC T5-4 result.  This is even more challenging in the "per core" metric that IBM values so highly. 

For x86, the story isn't any better.  Unfortunately, Intel doesn't have such handy rPerf charts, so I'll have to fall back to SPECint_rate2006 for this one. (Note that I am not a big fan of using one benchmark to estimate another.  Especially SPECcpu is not very suitable to estimate database performance as there is almost no IO involved.)  The above HP system is listed with 1580 CINT2006_rate.  The best result as of 2013-06-14 for the new Intel Xeon E7-4870 with 8 CPUs is 2180 CINT2006_rate.  That's an improvement of 1.38x.  (If we just take the increase in clockrate and core count, it would give us 1.32x.)  I'll stop here and let you do the math yourself - it's not very promising for x86...

Of course, IBM and others are welcome to prove me wrong - but as of today, I'm waiting for recent publications in this data range.

 So what have we learned?  

  • There's some evidence that storage might have been the limiting factor that prevented the SPARC T5-4 to scale beyond 2x
  • The myth that SPARC cores don't perform is just that - a myth.  Next time you meet one, ask your IBM sales rep when they'll publish TPC-H for Power7+
  • Cache memory isn't the magic performance switch some people think it is.
  • Scaling a CPU architecture (and the OS on top of it) beyond a certain limit is hard.  It seems to be a little harder in the x86 world.

What did I miss?  Well, price/performance is something I'll let you discuss with your sales reps ;-)

And finally, before people ask - no, I haven't moved to marketing.  But sometimes I just can't resist...


Disclosure Statements

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

TPC-H, QphH, $/QphH are trademarks of Transaction Processing Performance Council (TPC). For more information, see www.tpc.org, results as of 6/7/13. Prices are in USD. SPARC T5-4 409,721.8 QphH@3000GB, $3.94/QphH@3000GB, available 9/24/13, 4 processors, 64 cores, 512 threads; SPARC T4-4 205,792.0 QphH@3000GB, $4.10/QphH@3000GB, available 5/31/12, 4 processors, 32 cores, 256 threads; IBM Power 780 QphH@3000GB, 192,001.1 QphH@3000GB, $6.37/QphH@3000GB, available 11/30/11, 8 processors, 32 cores, 128 threads; HP ProLiant DL980 G7 162,601.7 QphH@3000GB, $2.68/QphH@3000GB available 10/13/10, 8 processors, 64 cores, 128 threads.

SPEC and the benchmark names SPECfp and SPECint are registered trademarks of the Standard Performance Evaluation Corporation. Results as of June 18, 2013 from www.spec.org. HP ProLiant DL980 G7 (2.27 GHz, Intel Xeon X7560): 1580 SPECint_rate2006; HP ProLiant DL980 G7 (2.4 GHz, Intel Xeon E7-4870): 2180 SPECint_rate2006.

Donnerstag Apr 04, 2013

A few remarks about T5

By now, most of you will have seen the announcement of the T5 and M5 systems.  I don't intend to repeat any of this, but I would like to share a few early thoughts.  Keep in mind, those thoughts are mine alone, not Oracle's.

It was rather obvious during the Launch Event that we will enjoy the competition with IBM even more than before.  I will not join the battle of words here, but leave you with a very nice summary (of the first few skirmishes) found on Forbes.  It is worth 2 minutes of reading - I find it very interesting how IBM seems to be loosing interest in performance...

Since much of the attention we are getting is based on performance claims, I thought it would be nice to have a short and clearly arranged overview of the more commonly used benchmark results that were posted.  I will not compare the results to any other systems here, but leave this as a very entertaining exercise to you ;-)

There are more performance publications, especially on the BestPerf blog.  Some of these are interesting because they compare T5 to x86 CPUs, something I recommend doing if you don't shy away from reconsidering your view of the world from time to time.  But the ones I listed here are more likely to be accepted as "independent" benchmarks than some others.  Now, we all know that benchmarking is a leap-frogging game, I wonder who will jump next?  (We've leap-frogged our own systems a couple times, too...)    And to finish this entry off, I'd like to remind you that performance is only one part of the equation.  What usually matters just as much, if not more, is price performance.  In the benchmarking game, we can usually only compare list prices - have a go at that!  To quote Larry here:  “You can go faster, but only if you are willing to pay 80% less than what IBM charges.”

Competition is an interesting thing, don't you think?

Montag Jan 14, 2013

LDoms IO Best Practices & T4 Red Crypto Stack

In November, I presented at DOAG Konferenz & Ausstellung 2012.  Now, almost two months later, I finally get around to posting the slides here...

  • In "LDoms IO Best Practices" I discuss different IO options for both disk and networking and give some recommens on how you to choose the right ones for your environment.  A couple hints about performance are also included.

I hope the slides are useful!

Freitag Dez 21, 2012

What's up with LDoms: Part 6 - Sizing the IO Domain

Before Christmas break, let's look at a topic that's one of the more frequently asked questions: Sizing of the Control Domain and IO Domain.

By now, we've seen how to create the basic setup, create a simple domain and configure networking and disk IO.  We know that for typical virtual IO, we use vswitches and virtual disk services to provide virtual network and disk services to the guests.  The question to address here is: How much CPU and memory is required in the Control and IO-domain (or in any additional IO domain) to provide these services without being a bottleneck?

The answer to this question can be very quick: LDoms Engineering usually recommends 1 or 2 cores for the Control Domain.

However, as always, one size doesn't fit all, and I'd like to look a little closer. 

Essentially, this is a sizing question just like any other system sizing.  So the first question to ask is: What services is the Control Domain providing that need CPU or memory resources?  We can then continue to estimate or measure exactly how much of each we will need. 

As for the services, the answer is straight forward: 

  • The Control Domain usually provides
    • Console Services using vntsd
    • Dynamic Reconfiguration and other infrastructure services
    • Live Migration
  • Any IO Domain (either the Control Domain or an additional IO domain) provides
    • Disk Services configured through the vds
    • Network Services configured through the vswitch

For sizing, it is safe to assume that vntsd, ldmd (the actual LDoms Manager daemon), ldmad (the LDoms agent) and any other infrastructure tasks will require very little CPU and can be ignored.  Let's look at the remaining three services:

  • Disk Services
    Disk Services have two parts:  Data transfer from the IO domain to the backend devices and data transfer from the IO Domain to the guest.  Disk IO in the IO domain is relatively cheap, you don't need many CPU cycles to deal with it.  I have found 1-2 threads of a T2 CPU to be sufficient for about 15.000 IOPS.  Today we usually use T4...
    However, this also depends on the type of backend storage you use.  FC or SAS rawdevice LUNs will have very little CPU overhead.  OTOH, if you use files hosted on NFS or ZFS, you are likely to see more CPU activity involved.  Here, your mileage will vary, depending on the configuration and usage pattern.  Also keep in mind that backends hosted on NFS or iSCSI also involve network traffic.
  • Network Services - vswitches
    There is a very old sizing rule that says that you need 1 GHz worth of CPU to saturate 1GBit worth of ethernet.  SAE has published a network encryption benchmark where a single T4 CPU at 2.85 GHz will transmit around 9 GBit at 20% utilization.  Converted into strands and cores, that would mean about 13 strands - less than 2 cores for 9GBit worth of traffic.  Encrypted, mind you.  Applying the mentioned old rule to this result, we would need just over 3 cores at 2.85 GHz to do 9 GBit - it seems we've made some progress in efficiency ;-)
    Applying all of this to IO Domain sizing, I would consider 2 cores an upper bound for typical installations, where you might very well get along with just one core, especially on smaller systems like the T4-1, where you're not likely to have several guest systems that each require  10GBit wirespeed networking.
  • Live Migration
    When considering Live Migration, we should understand that the Control Domains of the two involved systems are the ones actually doing all the work.  They encrypt, compress and send the source system's memory to the target system.  For this, they need quite a bit of CPU.  Of course, one could argue that Live Migration is something happening in the background, so it doesn't matter how fast it's actually done.  However, there's still the suspend-phase, where the guest system is suspended and the remaining dirty memory pages copied over to the other side.  This phase, while typically very very short, significantly impacts the "live" experience of Live Migration.  And while other factors like guest activity level and memory size also play a role, there's also a direct connection between CPU power and the length of this suspend time.  The relation between Control Domain CPU configuration and suspend time has been studied and published in a Whitepaper "Increasing Application Availability Using Oracle VM Server for SPARC (LDoms) An Oracle Database Example".  The conclusion: For minimum suspend times, configure 3 cores in the Control Domain.  I personally have made good experience with 2 cores, measuring suspend times as low as 0.1 second with a very idle domain, so again, your mileage will vary.

    Another thought here:  The Control Domain doesn't usually do Live Migration on a permanent basis.  So if a single core is sufficient for the IO Domain role of the Control Domain, you are in good shape for everyday business with just one core.  When you need additional CPU for a quick Live Migration, why not borrow it from somewhere else, like the domain being migrated, or any other domain not currently very busy?  CPU DR does lend itself for this purpose...

As you've seen, there are some rules, there is some experience, but still, there isn't the single, one answer.  In many cases, you should be ok with a single core on T4 for each IO domain.  If you use Live Migration a lot, you might want to add another core to the Control Domain.  On larger systems with higher networking demands, two cores for each IO Domain might be right.  If these recommendations are good enough for you, you're done.  If you want to dig deeper, simply check what's really going on in your IO Domains.  Use mpstat (1M) to study the utilization of your IO Domain's CPUs in times of high activity.  Perhaps record CPU utilization over a period of time, using your tool of choice.  (I recommend DimSTAT for that.)  With these results, you should be able to adjust the amount of CPU resources of your IO Domains to your exact needs.  However, when doing that, please remember those unannounced utilization peaks - don't be too stingy.  Saving one or two CPU strands won't buy you too much, all things considered.

A few words about memory:  This is much more straight forward.  If you're not using ZFS as a backing store for your virtual disks, you should be well in the green with 2-4GB of RAM.  My current test system, running Solaris 11.0 in the Control Domain, needs less than 600 MB of virtual memory.  Remember that 1GB is the supported minimum for Solaris 11 (and it's changed to 1.5 GB for Solaris 11.1). If you do use ZFS, you might want to reserve a couple GB for its ARC, so perhaps 8 GB are more appropriate.  On the Control Domain, which is the first domain to be bound, take 7680MB, which add up to 8GB together with the hypervisor's own 512MB, nicely fitting the 8GB boundary favoured by the memory controllers.  Again, if you want to be precise, monitor memory usage in your IO domains.

Links:

Update: I just learned that the hypervisor doesn't always take exactly 512MB. So if you do want to align with the 8GB boundary, check the sizes using "ldm ls-devices -a mem". Everything bound to "sys" is owned by the hypervisor.

Mittwoch Nov 07, 2012

What's up with LDoms: Part 5 - A few Words about Consoles

Back again to look at a detail of LDom configuration that is often forgotten - the virtual console server.

Remember, LDoms are SPARC systems.  As such, each guest will have it's own OBP running.  And to connect to that OBP, the administrator will need a console connection.  Since it's OBP, and not some x86 BIOS, this console will be very serial in nature ;-)  It's really very much like in the good old days, where we had a terminal concentrator where all those serial cables ended up in.  Just like with other components in LDoms, the virtualized solution looks very similar.

Every LDom guest requires exactly one console connection.  Envision this similar to the RS-232 port on older SPARC systems.  The LDom framework provides one or more console services that provide access to these connections.  This would be the virtual equivalent of a network terminal server (NTS), where all those serial cables are plugged in.  In the physical world, we'd have a list somewhere, that would tell us which TCP-Port of the NTS was connected to which server.  "ldm list" does just that:

root@sun # ldm list
NAME             STATE      FLAGS   CONS    VCPU  MEMORY   UTIL  UPTIME
primary          active     -n-cv-  UART    16    7680M    0.4%  27d 8h 22m
jupiter          bound      ------  5002    20    8G             
mars             active     -n----  5000    2     8G       0.5%  55d 14h 10m
venus            active     -n----  5001    2     8G       0.5%  56d 40m
pluto            inactive   ------          4     4G             

The column marked "CONS" tells us, where to reach the console of each domain. In the case of the primary domain, this is actually a (more) physical connection - it's the console connection of the physical system, which is either reachable via the ILOM of that system, or directly via the serial console port on the chassis. All the other guests are reachable through the console service which we created during the inital setup of the system.  Note that pluto does not have a port assigned.  This is because pluto is not yet bound.  (Binding can be viewed very much as the assembly of computer parts - CPU, Memory, disks, network adapters and a serial console cable are all put together when binding the domain.)  Unless we set the port number explicitly, LDoms Manager will do this on a first come, first serve basis.  For just a few domains, this is fine.  For larger deployments, it might be a good idea to assign these port numbers manually using the "ldm set-vcons" command.  However, there is even better magic associated with virtual consoles.

You can group several domains into one console group, reachable through one TCP port of the console service.  This can be useful when several groups of administrators are to be given access to different domains, or for other grouping reasons.  Here's an example:

root@sun # ldm set-vcons group=planets service=console jupiter
root@sun # ldm set-vcons group=planets service=console pluto
root@sun # ldm bind jupiter 
root@sun # ldm bind pluto
root@sun # ldm list
NAME             STATE      FLAGS   CONS    VCPU  MEMORY   UTIL  UPTIME
primary          active     -n-cv-  UART    16    7680M    6.1%  27d 8h 24m
jupiter          bound      ------  5002    200   8G             
mars             active     -n----  5000    2     8G       0.6%  55d 14h 12m
pluto            bound      ------  5002    4     4G             
venus            active     -n----  5001    2     8G       0.5%  56d 42m

root@sun # telnet localhost 5002
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.

sun-vnts-planets: h, l, c{id}, n{name}, q:l
DOMAIN ID           DOMAIN NAME                   DOMAIN STATE             
2                   jupiter                       online                   
3                   pluto                         online                   

sun-vnts-planets: h, l, c{id}, n{name}, q:npluto
Connecting to console "pluto" in group "planets" ....
Press ~? for control options ..

What I did here was add the two domains pluto and jupiter to a new console group called "planets" on the service "console" running in the primary domain.  Simply using a group name will create such a group, if it doesn't already exist.  By default, each domain has its own group, using the domain name as the group name.  The group will be available on port 5002, chosen by LDoms Manager because I didn't specify it.  If I connect to that console group, I will now first be prompted to choose the domain I want to connect to from a little menu.

Finally, here's an example how to assign port numbers explicitly:

root@sun # ldm set-vcons port=5044 group=pluto service=console pluto
root@sun # ldm bind pluto
root@sun # ldm list
NAME             STATE      FLAGS   CONS    VCPU  MEMORY   UTIL  UPTIME
primary          active     -n-cv-  UART    16    7680M    3.8%  27d 8h 54m
jupiter          active     -t----  5002    200   8G       0.5%  30m
mars             active     -n----  5000    2     8G       0.6%  55d 14h 43m
pluto            bound      ------  5044    4     4G             
venus            active     -n----  5001    2     8G       0.4%  56d 1h 13m

With this, pluto would always be reachable on port 5044 in its own exclusive console group, no matter in which order other domains are bound.

Now, you might be wondering why we always have to mention the console service name, "console" in all the examples here.  The simple answer is because there could be more than one such console service.  For all "normal" use, a single console service is absolutely sufficient.  But the system is flexible enough to allow more than that single one, should you need them.  In fact, you could even configure such a console service on a domain other than the primary (or control domain), which would make that domain a real console server.  I actually have a customer who does just that - they want to separate console access from the control domain functionality.  But this is definately a rather sophisticated setup.

Something I don't want to go into in this post is access control.  vntsd, which is the daemon providing all these console services, is fully RBAC-aware, and you can configure authorizations for individual users to connect to console groups or individual domain's consoles.  If you can't wait until I get around to security, check out the man page of vntsd.

Further reading:

  • The Admin Guide is rather reserved on this subject.  I do recommend to check out the Reference Manual.
  • The manpage for vntsd will discuss all the control sequences as well as the grouping and authorizations mentioned here.

Montag Okt 01, 2012

What's up with LDoms: Part 4 - Virtual Networking Explained

I'm back from my summer break (and some pressing business that kept me away from this), ready to continue with Oracle VM Server for SPARC ;-)

In this article, we'll have a closer look at virtual networking.  Basic connectivity as we've seen it in the first, simple example, is easy enough.  But there are numerous options for the virtual switches and virtual network ports, which we will discuss in more detail now.   In this section, we will concentrate on virtual networking - the capabilities of virtual switches and virtual network ports - only.  Other options involving hardware assignment or redundancy will be covered in separate sections later on.

There are two basic components involved in virtual networking for LDoms: Virtual switches and virtual network devices.  The virtual switch should be seen just like a real ethernet switch.  It "runs" in the service domain and moves ethernet packets back and forth.  A virtual network device is plumbed in the guest domain.  It corresponds to a physical network device in the real world.  There, you'd be plugging a cable into the network port, and plug the other end of that cable into a switch.  In the virtual world, you do the same:  You create a virtual network device for your guest and connect it to a virtual switch in a service domain.  The result works just like in the physical world, the network device sends and receives ethernet packets, and the switch does all those things ethernet switches tend to do.

If you look at the reference manual of Oracle VM Server for SPARC, there are numerous options for virtual switches and network devices.  Don't be confused, it's rather straight forward, really.  Let's start with the simple case, and work our way to some more sophisticated options later on. 

In many cases, you'll want to have several guests that communicate with the outside world on the same ethernet segment.  In the real world, you'd connect each of these systems to the same ethernet switch.  So, let's do the same thing in the virtual world:

root@sun # ldm add-vsw net-dev=nxge2 admin-vsw primary
root@sun # ldm add-vnet admin-net admin-vsw mars
root@sun # ldm add-vnet admin-net admin-vsw venus

We've just created a virtual switch called "admin-vsw" and connected it to the physical device nxge2.  In the physical world, we'd have powered up our ethernet switch and installed a cable between it and our big enterprise datacenter switch.  We then created a virtual network interface for each one of the two guest systems "mars" and "venus" and connected both to that virtual switch.  They can now communicate with each other and with any system reachable via nxge2.  If primary were running Solaris 10, communication with the guests would not be possible.  This is different with Solaris 11, please see the Admin Guide for details.  Note that I've given both the vswitch and the vnet devices some sensible names, something I always recommend.

Unless told otherwise, the LDoms Manager software will automatically assign MAC addresses to all network elements that need one.  It will also make sure that these MAC addresses are unique and reuse MAC addresses to play nice with all those friendly DHCP servers out there.  However, if we want to do this manually, we can also do that.  (One reason might be firewall rules that work on MAC addresses.)  So let's give mars a manually assigned MAC address:

root@sun # ldm set-vnet mac-addr=0:14:4f:f9:c4:13 admin-net mars

Within the guest, these virtual network devices have their own device driver.  In Solaris 10, they'd appear as "vnet0".  Solaris 11 would apply it's usual vanity naming scheme.  We can configure these interfaces just like any normal interface, give it an IP-address and configure sophisticated routing rules, just like on bare metal. 

In many cases, using Jumbo Frames helps increase throughput performance.  By default, these interfaces will run with the standard ethernet MTU of 1500 bytes.  To change this,  it is usually sufficient to set the desired MTU for the virtual switch.  This will automatically set the same MTU for all vnet devices attached to that switch.  Let's change the MTU size of our admin-vsw from the example above:

root@sun # ldm set-vsw mtu=9000 admin-vsw primary

Note that that you can set the MTU to any value between 1500 and 16000.  Of course, whatever you set needs to be supported by the physical network, too.

Another very common area of network configuration is VLAN tagging. This can be a little confusing - my advise here is to be very clear on what you want, and perhaps draw a little diagram the first few times.  As always, keeping a configuration simple will help avoid errors of all kind.  Nevertheless, VLAN tagging is very usefull to consolidate different networks onto one physical cable.  And as such, this concept needs to be carried over into the virtual world.  Enough of the introduction, here's a little diagram to help in explaining how VLANs work in LDoms:
VLANs in LDoms
Let's remember that any VLANs not explicitly tagged have the default VLAN ID of 1. In this example, we have a vswitch connected to a physical network that carries untagged traffic (VLAN ID 1) as well as VLANs 11, 22, 33 and 44.  There might also be other VLANs on the wire, but the vswitch will ignore all those packets.  We also have two vnet devices, one for mars and one for venus.  Venus will see traffic from VLANs 33 and 44 only.  For VLAN 44, venus will need to configure a tagged interface "vnet44000".  For VLAN 33, the vswitch will untag all incoming traffic for venus, so that venus will see this as "normal" or untagged ethernet traffic.  This is very useful to simplify guest configuration and also allows venus to perform Jumpstart or AI installations over this network even if the Jumpstart or AI server is connected via VLAN 33.  Mars, on the other hand, has full access to untagged traffic from the outside world, and also to VLANs 11,22 and 33, but not 44.  On the command line, we'd do this like this:

root@sun # ldm add-vsw net-dev=nxge2 pvid=1 vid=11,22,33,44 admin-vsw primary
root@sun # ldm add-vnet admin-net pvid=1 vid=11,22,33 admin-vsw mars
root@sun # ldm add-vnet admin-net pvid=33 vid=44 admin-vsw venus

Finally, I'd like to point to a neat little option that will make your live easier in all those cases where configurations tend to change over the live of a guest system.  It's the "id=<somenumber>" option available for both vswitches and vnet devices.  Normally, Solaris in the guest would enumerate network devices sequentially.  However, it has ways of remembering this initial numbering.  This is good in the physical world.  In the virtual world, whenever you unbind (aka power off and disassemble) a guest system, remove and/or add network devices and bind the system again, chances are this numbering will change.  Configuration confusion will follow suit.  To avoid this, nail down the initial numbering by assigning each vnet device it's device-id explicitly:

root@sun # ldm add-vnet admin-net id=1 admin-vsw venus

Please consult the Admin Guide for details on this, and how to decipher these network ids from Solaris running in the guest.

Thanks for reading this far.  Links for further reading are essentially only the Admin Guide and Reference Manual and can be found above.  I hope this is useful and, as always, I welcome any comments.


Montag Sep 10, 2012

Secure Deployment of Oracle VM Server for SPARC - updated

Quite a while ago, I published a paper with recommendations for a secure deployment of LDoms.  Many things happend in the mean time, and an update to that paper was due.  Besides some minor spelling corrections, many obsolete or changed links were updated.  However, the main reason for the update was the introduction of a second usage model for LDoms.  In a very short few words: With the success especially of the T4-4, many deployments make use of the hardware partitioning capabilities of that platform, assigning full PCIe root complexes to domains, mimicking dynamic system domains if you will.  This different way of using the hypervisor needed to be addressed in the paper.  You can find the updated version here:

Secure Deployment of Oracle VM Server for SPARC
Second Edition

I hope it'll be useful!

Freitag Jul 13, 2012

What's up with LDoms: Part 3 - A closer look at Disk Backend Choices

In this section, we'll have a closer look at virtual disk backends and the various choises available here.  As a little reminder, a disk backend, in LDoms speak, is the physical storage used when creating a virtual disk for a guest system.  In other virtualization solutions, these are sometimes called virtual disk images, a term that doesn't really fit for all possible options available in LDoms.

In the previous example, we used a ZFS volume as a backend for the boot disk of mars.  But there are many other ways to store the data of virtual disks.  The relevant section in the Admin Guide lists all the available options:

  • Physical LUNs, in any variant that the Control Domain supports.  This of course includes SAN, iSCSI and SAS, including the internal disks of the host system.
  • Logical Volumes like ZFS Volumes, but also SVM or VxVM
  • Regular Files. These can be stored in any filesystem, as long as they're accessible by the LDoms subsystem. This includes storage on NFS.

Each of these backend devices have their own set of characteristica that should be considered when deciding which backend type to use.  Let's look at them in a little more detail.

LUNs are the most generic option. By assigning a virtual disk to a LUN backend, the guest essentially gains full access to the underlying storage device, whatever that might be.  It will see the volume label of the LUN, it can see and alter the partition table of the LUN, it can also read or set SCSI reservations on that device.  Depending on the way the LUN is connected to the host system, this very same LUN could also be attached to a second host and a guest residing on it, with the two guests sharing the data on that one LUN, or supporting live migration.  If there is a filesystem on the LUN, the guest will be able to mount that filesystem, just like any other system with access to that LUN, be it virtualized or direct.  Bear in mind that most filesystems are non-shared filesystems.  This doesn't change here, either.  For the IO domain (that's the domain where the physical LUN is connected) LUNs mean the least possible amount of work.  All it has to do is pass data blocks up and down to and from the LUN, there is a very minimum of driver layers invovled.

Flat files, on the other hand, are the most simple option, very similar in user experience to what one would do in a desktop hypervisor like VirtualBox.  The easiest way to create one is with the "mkfile" command.  For the guest, there is no real difference to LUNs.  The virtual disk will, just like in the LUN case, appear to be a full disk, partition table, label and all.  Of course, initially, it'll be all empty, so the first thing the guest usually needs to do is write a label to the disk.  The main difference to LUNs is in the way these image files are managed.  Since they are files in a filesystem, they can be copied, moved and deleted, all of which should be done with care, especially if the guest is still running.  They can be managed by the filesystem, which means attributes like compression, encryption or deduplication in ZFS could apply to them - fully transparent to the guest.  If the filesystem is a shared filesystem like NFS or SAM-FS, the file (and thus the disk image) could be shared by another LDom on another system, for example as a shared database disk or for live migration.  Their performance will be impacted by the filesystem, too.  The IO domain might cache some of the file, hoping to speed operations.  If there are many such image files on a single filesystem, they might impact each other's performance.  These files, by the way, need not be empty initially.  A typical use case would be a Solaris iso image file.  Adding it to a guest as a virtual disk will allow that guest to boot (and install) off that iso image as if it were a physical CD drive.

Finally, there are logical Volumes, typically created with volume managers such as Solaris Volume Manager (SVM) or Veritas Volume Manager (VxVM) or ZFS, of course.  For the guest, again, these look just like ordinary disks, very much like files.  The difference to files is in the management layer;  The logical volumes are created straigt from the underlying storage, without a filesystem layer in between.  In the database world, we would call these "raw devices", and their device names in Solaris are very similar to those of physical LUNs.  We need different commands to find out how large these volumes are, or how much space is left on the storage devices underneath.  Other than that, however, they are very similar to files in many ways.  Sharing them between two host systems is likely to be more complex, as one would need the corresponding cluster volume managers, which typically only really work in combination with Solaris Cluster.  One type of volume that deserves special mentioning is the ZFS Volume.  It offers all the features of a normal ZFS dataset: Clones, snapshots, compression, encryption, deduplication, etc.  Especially with snapshots and clones, they lend themselves as the ideal backend for all use cases that make heavy use of these features. 

For the sake of completeness, I'd like to mention that you can export all of these backends to a guest with or without the "slice" option, something that I consider less usefull in most cases, which is why I'd like to refer you to the relevant section in the admin guide if you want to know more about this.

Lastly, you do have the option to export these backends read-only to prevent any changes from the guests.  Keep in mind that even mounting a UFS filesystem read only would require a write operation to the virtual disk.  The most typical usecase for this is probably an iso-image, which can indeed be mounted read-only.  You can also export one backend to more than one guest.  In the physical world, this would correspond to using the same SAN LUN on several hosts, and the same restrictions with regards to shared filesystems etc. apply.

So now that we know about all these different options, when should we use which kind of backend ?  The answer, as usual, is: It depends!

LUNs require a SAN (or iSCSI) infrastructure which we tend to associate with higher cost.  On the other hand, they can be shared between many hosts, are easily mapped from host to host and bring a rich feature set of storage management and redundancy with them.  I recommend LUNs (especially SAN) for both boot devices and data disks of guest systems in production environments.  My main reasons for this are:

  • They are very light-weight on the IO domain
  • They avoid any double buffering of data in the guest and in the IO domain because there is no filesystem layer involved in the IO domain.
  • Redundancy for the device and the data path is easy
  • They allow sharing between hosts, which in turn allows cluster implementations and live migration
  • All ZFS features can be implemented in the guest, if desired.

For test and development, my first choice is usually the ZFS volume.  Unlike VxVM, it comes free of charge, and it's features like snapshots and clones meet the typical requirements of such environments to quickly create, copy and destroy test environments.  I explicitly recommend against using ZFS snapshots/clones (files or volumes) over a longer period of time.  Since ZFS records the delta between the original image and the clones, the space overhead will eventually grow to a multiple of the initial size and eventually even prevent further IO to the virtual disk if the zpool is full.  Also keep in mind that ZFS is not a shared filesystem.  This prevents guest that use ZFS files or volumes as virtual disks from doing live migration.  Which leads directly to the recommendation for files:

I recommend files on NFS (or other shared filesystems) in all those cases where SAN LUNs are not available but shared access to disk images is required because of live migration (or because of cluster software like Solaris Cluster or RAC is running in the guests).  The functionality is mostly the same as for LUNs, with the exception of SCSI reservations, which don't work with a file backend.  However, CPU requirements in the IO domain and performance of NFS files as compared to SAN LUNs is likely to be different, which is why I strongly recommend to use SAN LUNs for all prodution use cases.

Further reading:

Freitag Jun 29, 2012

What's up with LDoms: Part 2 - Creating a first, simple guest

Welcome back!

In the first part, we discussed the basic concepts of LDoms and how to configure a simple control domain.  We saw how resources were put aside for guest systems and what infrastructure we need for them.  With that, we are now ready to create a first, very simple guest domain.  In this first example, we'll keep things very simple.  Later on, we'll have a detailed look at things like sizing, IO redundancy, other types of IO as well as security.

For now,let's start with this very simple guest.  It'll have one core's worth of CPU, one crypto unit, 8GB of RAM, a single boot disk and one network port.  (If this were a T4 system, we'd not have to assign the crypto units.  Since this is T3, it makes lots of sense to do so.)  CPU and RAM are easy.  The network port we'll create by attaching a virtual network port to the vswitch we created in the primary domain.  This is very much like plugging a cable into a computer system on one end and a network switch on the other.  For the boot disk, we'll need two things: A physical piece of storage to hold the data - this is called the backend device in LDoms speak.  And then a mapping between that storage and the guest domain, giving it access to that virtual disk.  For this example, we'll use a ZFS volume for the backend.  We'll discuss what other options there are for this and how to chose the right one in a later article.  Here we go:

root@sun # ldm create mars

root@sun # ldm set-vcpu 8 mars 
root@sun # ldm set-mau 1 mars 
root@sun # ldm set-memory 8g mars

root@sun # zfs create rpool/guests
root@sun # zfs create -V 32g rpool/guests/mars.bootdisk
root@sun # ldm add-vdsdev /dev/zvol/dsk/rpool/guests/mars.bootdisk \
           mars.root@primary-vds
root@sun # ldm add-vdisk root mars.root@primary-vds mars

root@sun # ldm add-vnet net0 switch-primary mars

That's all, mars is now ready to power on.  There are just three commands between us and the OK prompt of mars:  We have to "bind" the domain, start it and connect to its console.  Binding is the process where the hypervisor actually puts all the pieces that we've configured together.  If we made a mistake, binding is where we'll be told (starting in version 2.1, a lot of sanity checking has been put into the config commands themselves, but binding will catch everything else).  Once bound, we can start (and of course later stop) the domain, which will trigger the boot process of OBP.  By default, the domain will then try to boot right away.  If we don't want that, we can set "auto-boot?" to false.  Finally, we'll use telnet to connect to the console of our newly created guest.  The output of "ldm list" shows us what port has been assigned to mars.  By default, the console service only listens on the loopback interface, so using telnet is not a large security concern here.

root@sun # ldm set-variable auto-boot\?=false mars
root@sun # ldm bind mars
root@sun # ldm start mars 

root@sun # ldm list
NAME             STATE      FLAGS   CONS    VCPU  MEMORY   UTIL  UPTIME
primary          active     -n-cv-  UART    8     7680M    0.5%  1d 4h 30m
mars             active     -t----  5000    8     8G        12%  1s

root@sun # telnet localhost 5000

Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.

~Connecting to console "mars" in group "mars" ....
Press ~? for control options ..

{0} ok banner

SPARC T3-4, No Keyboard
Copyright (c) 1998, 2011, Oracle and/or its affiliates. All rights reserved.
OpenBoot 4.33.1, 8192 MB memory available, Serial # 87203131.
Ethernet address 0:21:28:24:1b:50, Host ID: 85241b50.

{0} ok 

We're done, mars is ready to install Solaris, preferably using AI, of course ;-)  But before we do that, let's have a little look at the OBP environment to see how our virtual devices show up here:

{0} ok printenv auto-boot?
auto-boot? =            false

{0} ok printenv boot-device
boot-device =           disk net

{0} ok devalias
root                     /virtual-devices@100/channel-devices@200/disk@0
net0                     /virtual-devices@100/channel-devices@200/network@0
net                      /virtual-devices@100/channel-devices@200/network@0
disk                     /virtual-devices@100/channel-devices@200/disk@0
virtual-console          /virtual-devices/console@1
name                     aliases

We can see that setting the OBP variable "auto-boot?" to false with the ldm command worked.  Of course, we'd normally set this to "true" to allow Solaris to boot right away once the LDom guest is started.  The setting for "boot-device" is the default "disk net", which means OBP would try to boot off the devices pointed to by the aliases "disk" and "net" in that order, which usually means "disk" once Solaris is installed on the disk image.  The actual devices these aliases point to are shown with the command "devalias".  Here, we have one line for both "disk" and "net".  The device paths speak for themselves.  Note that each of these devices has a second alias: "net0" for the network device and "root" for the disk device.  These are the very same names we've given these devices in the control domain with the commands "ldm add-vnet" and "ldm add-vdisk".  Remember this, as it is very useful once you have several dozen disk devices...

To wrap this up, in this part we've created a simple guest domain, complete with CPU, memory, boot disk and network connectivity.  This should be enough to get you going.  I will cover all the more advanced features and a little more theoretical background in several follow-on articles.  For some background reading, I'd recommend the following links:

What's up with LDoms: Part 1 - Introduction & Basic Concepts

LDoms - the correct name is Oracle VM Server for SPARC - have been around for quite a while now.  But to my surprise, I get more and more requests to explain how they work or to give advise on how to make good use of them.  This made me think that writing up a few articles discussing the different features would be a good idea.  Now - I don't intend to rewrite the LDoms Admin Guide or to copy and reformat the (hopefully) well known "Beginners Guide to LDoms" by Tony Shoumack from 2007.  Those documents are very recommendable - especially the Beginners Guide, although based on LDoms 1.0, is still a good place to begin with.  However, LDoms have come a long way since then, and I hope to contribute to their adoption by discussing how they work and what features there are today.

 In this and the following posts, I will use the term "LDoms" as a common abbreviation for Oracle VM Server for SPARC, just because it's a lot shorter and easier to type (and presumably, read).

So, just to get everyone on the same baseline, lets briefly discuss the basic concepts of virtualization with LDoms.  LDoms make use of a hypervisor as a layer of abstraction between real, physical hardware and virtual hardware.  This virtual hardware is then used to create a number of guest systems which each behave very similar to a system running on bare metal:  Each has its own OBP, each will install its own copy of the Solaris OS and each will see a certain amount of CPU, memory, disk and network resources available to it.  Unlike some other type 1 hypervisors running on x86 hardware, the SPARC hypervisor is embedded in the system firmware and makes use both of supporting functions in the sun4v SPARC instruction set as well as the overall CPU architecture to fulfill its function.

The CMT architecture of the supporting CPUs (T1 through T4) provide a large number of cores and threads to the OS.  For example, the current T4 CPU has eight cores, each running 8 threads, for a total of 64 threads per socket.  To the OS, this looks like 64 CPUs. 

The SPARC hypervisor, when creating guest systems, simply assigns a certain number of these threads exclusively to one guest, thus avoiding the overhead of having to schedule OS threads to CPUs, as do typical x86 hypervisors.  The hypervisor only assigns CPUs and then steps aside.  It is not involved in the actual work being dispatched from the OS to the CPU, all it does is maintain isolation between different guests.

Likewise, memory is assigned exclusively to individual guests.  Here,  the hypervisor provides generic mappings between the physical hardware addresses and the guest's views on memory.  Again, the hypervisor is not involved in the actual memory access, it only maintains isolation between guests.

During the inital setup of a system with LDoms, you start with one special domain, called the Control Domain.  Initially, this domain owns all the hardware available in the system, including all CPUs, all RAM and all IO resources.  If you'd be running the system un-virtualized, this would be what you'd be working with.  To allow for guests, you first resize this initial domain (also called a primary domain in LDoms speak), assigning it a small amount of CPU and memory.  This frees up most of the available CPU and memory resources for guest domains. 

IO is a little more complex, but very straightforward.  When LDoms 1.0 first came out, the only way to provide IO to guest systems was to create virtual disk and network services and attach guests to these services.  In the meantime, several different ways to connect guest domains to IO have been developed, the most recent one being SR-IOV support for network devices released in version 2.2 of Oracle VM Server for SPARC. I will cover these more advanced features in detail later.  For now, lets have a short look at the initial way IO was virtualized in LDoms:

For virtualized IO, you create two services, one "Virtual Disk Service" or vds, and one "Virtual Switch" or vswitch.  You can, of course, also create more of these, but that's more advanced than I want to cover in this introduction.  These IO services now connect real, physical IO resources like a disk LUN or a networt port to the virtual devices that are assigned to guest domains.  For disk IO, the normal case would be to connect a physical LUN (or some other storage option that I'll discuss later) to one specific guest.  That guest would be assigned a virtual disk, which would appear to be just like a real LUN to the guest, while the IO is actually routed through the virtual disk service down to the physical device.  For network, the vswitch acts very much like a real, physical ethernet switch - you connect one physical port to it for outside connectivity and define one or more connections per guest, just like you would plug cables between a real switch and a real system. For completeness, there is another service that provides console access to guest domains which mimics the behavior of serial terminal servers.

The connections between the virtual devices on the guest's side and the virtual IO services in the primary domain are created by the hypervisor.  It uses so called "Logical Domain Channels" or LDCs to create point-to-point connections between all of these devices and services.  These LDCs work very similar to high speed serial connections and are configured automatically whenever the Control Domain adds or removes virtual IO.

To see all this in action, now lets look at a first example.  I will start with a newly installed machine and configure the control domain so that it's ready to create guest systems.

In a first step, after we've installed the software, let's start the virtual console service and downsize the primary domain. 

root@sun # ldm list
NAME     STATE    FLAGS   CONS  VCPU  MEMORY   UTIL  UPTIME
primary  active   -n-c--  UART  512   261632M  0.3%  2d 13h 58m

root@sun # ldm add-vconscon port-range=5000-5100 \
               primary-console primary
root@sun # svcadm enable vntsd
root@sun # svcs vntsd
STATE          STIME    FMRI
online          9:53:21 svc:/ldoms/vntsd:default

root@sun # ldm set-vcpu 16 primary
root@sun # ldm set-mau 1 primary
root@sun # ldm start-reconf primary
root@sun # ldm set-memory 7680m primary
root@sun # ldm add-config initial
root@sun # shutdown -y -g0 -i6 

So what have I done:

  • I've defined a range of ports (5000-5100) for the virtual network terminal service and then started that service.  The vnts will later provide console connections to guest systems, very much like serial NTS's do in the physical world.
  • Next, I assigned 16 vCPUs (on this platform, a T3-4, that's two cores) to the primary domain, freeing the rest up for future guest systems.  I also assigned one MAU to this domain.  A MAU is a crypto unit in the T3 CPU.  These need to be explicitly assigned to domains, just like CPU or memory.  (This is no longer the case with T4 systems, where crypto is always available everywhere.)
  • Before I reassigned the memory, I started what's called a "delayed reconfiguration" session.  That avoids actually doing the change right away, which would take a considerable amount of time in this case.  Instead, I'll need to reboot once I'm all done.  I've assigned 7680MB of RAM to the primary.  That's 8GB less the 512MB which the hypervisor uses for it's own private purposes.  You can, depending on your needs, work with less.  I'll spend a dedicated article on sizing, discussing the pros and cons in detail.
  • Finally, just before the reboot, I saved my work on the ILOM, to make this configuration available after a powercycle of the box.  (It'll always be available after a simple reboot, but the ILOM needs to know the configuration of the hypervisor after a power-cycle, before the primary domain is booted.)

Now, lets create a first disk service and a first virtual switch which is connected to the physical network device igb2. We will later use these to connect virtual disks and virtual network ports of our guest systems to real world storage and network.

root@sun # ldm add-vds primary-vds
root@sun # ldm add-vswitch net-dev=igb2 switch-primary primary

You are free to choose whatever names you like for the virtual disk service and the virtual switch.  I strongly recommend that you choose names that make sense to you and describe the function of each service in the context of your implementation.  For the vswitch, for example, you could choose names like "admin-vswitch" or "production-network" etc.

This already concludes the configuration of the control domain.  We've freed up considerable amounts of CPU and RAM for guest systems and created the necessary infrastructure - console, vts and vswitch - so that guests systems can actually interact with the outside world.  The system is now ready to create guests, which I'll describe in the next section.

For further reading, here are some recommendable links:

Oracle VM Server for SPARC Demo Videos

I just stumbled across several well done demos for newer LDoms features.  Find them all in the youtube channel "Oracle VM Server for SPARC".  I'd like to recommend the ones about power management and cross CPU migration specifically :-)
About

Neuigkeiten, Tipps und Wissenswertes rund um SPARC, CMT, Performance und ihre Analyse sowie Erfahrungen mit Solaris auf dem Server und dem Laptop.

This is a bilingual blog (most of the time). Please select your prefered language:
.
The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Search

Categories
Archives
« July 2014
SunMonTueWedThuFriSat
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
   
       
Today