LDoms guest domains supported as Solaris Cluster nodes

Folks, when late last year we announced support for Solaris Cluster in LDoms I/O domains on this blog entry , we also hinted about support for LDoms guest domains. It has taken a bit longer then we envisaged, but i am pleased to report that SC Marketing has just announced support for LDoms guest domains with Solaris Cluster!!

So, what exactly does "support" mean here? It means that you can create a LDoms guest domain running Solaris, and then treat that guest domain as a cluster node by installing SC software (specific version and patch information noted later in the blog) inside the guest domain and have the SC software work with the virtual devices in the guest domain. The technically inclined reader would, at this point, have several questions pop into his head... How exactly does SC work with virtual devices? What do i have to do to make SC recognize these devices? Are there any differences between how SC is configured in LDoms guest domains, vs non-virtualized environments? Read-on below for a high level summary of specifics:

  • For shared storage devices (i.e. those accessible from multiple cluster nodes), the virtual device must be backed by a full SCSI LUN. That means, no file backed virtual devices, no slices, no volumes. This limitation is required because SC needs advanced features in the storage devices to guarantee data integrity and those features are available only for virtual storage devices backed by full SCSI LUNs.

  • One may need to use storage which is unshared (ie is accessed from only one cluster node), for things such as OS image installation for the guest domain. For such usage, any type of virtual devices can be used, including those backed by files in the I/O domain. However, for such virtual devices, make sure to configure them to be synchronous. Check LDoms documentation and release notes on how to do that. Currently (as of July 2008) one needs to add "set vds:vd_file_write_flags = 0" to the /etc/system file in the I/O domain exporting the file. This is required because the Cluster stores some key configuration information on the root filesystem (in /etc/cluster) and it expects that the information written to this location is written synchronously to the disks. If the root filesystem of the guest domain is on a file in the I/O domain, it needs this setting to be synchronous.

  • Network based storage (NAS etc.) is fine when used from within the guest domain. Check cluster support matrix for specifics. LDoms guest domains don't change this support.

  • For cluster private interconnect, the LDoms virtual device "vnet" can be used just fine, however the virtual switch which it maps must have the option "mode=sc" specified for it. So essentially, for the command ldm subcommand add-vsw, you would add another argument "mode=sc" on the command line while creating the virtual switch which would be used for cluster private interconnect inside the guest domains. This option enables a fastpath in the I/O domain for the Cluster heartbeat packets so that those packets do not compete with application network packets in the I/O domain for resources. This greatly improves the reliability of the Cluster heartbeats, even under heavy load, leading to a very stable cluster membership for applications to work with. Note however, that good engineering practices should still be followed while sizing your server resources (both in the I/O domain as well as in the guest domains) for the application load expected on the system.

  • With this announcement all features of Solaris Cluster supported in non-virtualized environments are supported in LDoms guest domains, unless explicitly noted in the SC release notes. Some limitations come from LDoms themselves, such as lack of jumbo frame support over virtual networks or lack of link based failure detection with IPMP in guest domains. Check LDoms documentation and release notes for such limitations as support for such missing features are improving all the time.

  • For support of specific applications with LDoms guest domains and SC, check with your ISV. Support for applications in LDoms guest domains is improving all the time, so check often.

  • Software version requirements. LDoms_1.0.3 or higher, S10U5 and patches 137111-01, 137042-01, 138042-02, and 138056-01 or higher are required in BOTH the LDoms guest domains as well as in the I/O domains exporting virtual devices to the guest domains. Solaris Cluster SC32U1 (3.2 2/08) with patch 126106-15 or higher is required in the LDoms guest domains.

  • Licensing for SC in LDoms guest domains follows the same model as those for the I/O domains. You basically pay for the physical server, irrespective of how many guest domains and I/O domains are deployed in that physical server.
  • This covers the high level overview of how SC is to be deployed inside the LDoms guest domains. Check out the SC Release notes for additional details, and some sample configurations. The whole virtualization space is evolving very rapidly and new developments are happening ever so quickly. Keep this blog page bookmarked and visit it frequently to find out how Solaris Cluster is evolving along with this space.

    Cheers!

    Ashutosh Tripathi
    Solaris Cluster Engineering

    Comments:

    [Trackback] Endlich ist es geschafft!  Seit heute werden LDoms als Knoten im SunCluster unterstuetzt.  Ja, richtig, Gast-Domains.  Die Details gibt es  in einem Blog von Ashutosh Tripathi in den Release Notes der aktuelle...

    Posted by Die Kernspalter on July 14, 2008 at 03:34 PM PDT #

    Any idea when a cluster service for ldom will be out? Ie with cluster I/O domains when one dies it moves the ldoms across etc.

    Posted by kangcool on July 19, 2008 at 09:54 AM PDT #

    Hi kang,

    There is a technical issue with the death of I/O domains. The guest domains keep running, just blocked on I/O. When the I/O domain comes back the guest domains continue. So, if we failover all the guest domains if the I/O domain dies, there is potential for problems. We are looking to solve the technical issue.

    It would help us if you describe the deployment problem you are trying to solve. Would running SC inside the guest domains work for your deployment?

    Regards,
    -ashu

    Posted by ashu on July 24, 2008 at 01:55 AM PDT #

    Just thinking for virtual hosting

    Ie here an image that you think is a real box, do what you want etc...

    But what if the box "blows up" you lose all your ldom guests. Joy.

    Ok you can restart each manually but that could be a pain, so why not get software to do it.

    Vmware now has this feature.

    Posted by kangcool on July 24, 2008 at 07:36 AM PDT #

    Hi kang,

    Yes that makes sense. But we have to differentiate between a box actually "blowing up" (as you put it), vs the box (or to be specific, the I/O domains on the box), just going for a reboot.

    Having said that, i do see the point that failing over the guest domains themselves has value. I did mention that technical issue in my earlier comment. We would have to see what is the best way to go about dealing with that.

    Thanks for your feedback,
    -ashu

    Posted by ashu on July 25, 2008 at 10:09 AM PDT #

    This is a topic I'm very interested in as well. Has any progress been made in determining whether or not this will be an offering? Is it on the roadmap and, if so, what is the forecasted timeframe for availability? Failover of domains would be just as beneficial as failover of containers.

    Posted by Jeff on September 08, 2008 at 07:26 AM PDT #

    Folks,

    To anyone who reads Jeff's comments above and wonders where is the response..... Since Jeff's questions related to product roadmaps and timelines, we are discussing this privately with some SUN folks in the loop.

    So, no... we don't ignore user comments here on the cluster Oasis!! :-) :-)

    Regards,
    -ashu

    Posted by ashu on September 09, 2008 at 08:20 AM PDT #

    Just for clarification: I have a single T6320 with two guest domains and one I/O domain. Is it supported for me to map the same physical device(s) from each guest LDOM? Or is this the fencing issue? This will be the only cluster on this box and the two guest LDOMs will be the only servers sharing the back-end disk.
    Thanks,
    Kyle

    Posted by Kyle on September 15, 2008 at 01:52 AM PDT #

    Hi Kyle,

    If you are running the SC software inside the two guest LDoms, indeed, there is a restriction on this configuration currently. You cannot share the same LUN across 2 guest domains on the same box.

    You should check out the SC release notes at http://wikis.sun.com/display/SunCluster/Sun+Cluster+3.2+2-08+Release+Notes#SunCluster3.22-08ReleaseNotes-optguestdomain

    On section entitled "SPARC: Guidelines for Logical Domains in a Cluster", second bullet where it talks about "Fencing", which explicitly talks about this not being supported as currently, there is no supported way to disable fencing.

    Regards,
    -ashu

    Posted by ashu on September 15, 2008 at 06:14 AM PDT #

    Hi, I am new to LDOMs but have 3 T5440 servers and have worked out how to setup ldoms. My problem is I am also using HP IVM on IA64 machines which is the HP virtualisation technology. With IVM I can setup 2 virtual machines and a virtual shared device on one IVM server, I change the parameters on the virtual file I created telling it that it is sharable then I can mount it on both IVM. Then I install Serviceguard on both virtual nodes and bobs yer uncle I can configure a completely vitual cluster that I can install OVO on, this all works just like a real cluster. Is this going to be possible with LDOMS and SC in future? or is it possible now? I have the latest SC 3.2 bits, however I can not work out how to mount a virtual disk file on both LDOMS. So it appears that LDOM and SC does not work the same way as HP IVM. Am I correct in my assumption or am I just missing some commands to share the virtual file.

    Cheers

    Steve

    Posted by Steve Luther on July 24, 2009 at 12:39 AM PDT #

    Hi Steve,

    Looks like the basic issue you have here is about sharing a virtual disk from multiple LDoms, on different servers. First thing to note is that with SC, the only type of virtual disks supported as shared storage are complete LUNs (so no file based virtual disks, for example).

    So basically you use the "ldm add-vds" and "ldm add-vdsdev" commands on the control domains on all servers to share the same disk (on shared storage) to guest LDoms on multiple servers. These guest domains would be running SC software.

    Did that help or were you looking for specific syntax for the ldm command? Check out the man page of the ldm command which is very well written.

    HTH,
    -ashu

    Posted by ashu on July 24, 2009 at 04:56 AM PDT #

    Hi Ashu,

    Thanks for replying, It did sort of clear the matter of using virtual files as a shared up but it is not really what I wanted to here. No I don't need the actual syntax for it I have already done that bit. So it appears that the SUN LDOM stuff is not as good as the HP IVM stuff, do you know if this wil lbe changed in the future? Sun will have a really good virtualisation solution if it worked similarly to HPs IVM stuff by allowing a shared virtual file rather than having to mount a LUN which takes away the complete virtualisation solution.

    One more question, I have created a 40gb file which I mounted on an LDOM and installed with my jumpstart server, this all works fine. is it possible for me to unmount that 40gb file copy it and mount it on another LDOM and get it working just by changing the hostname and IP address once it is booted? I can do this with HP IVM virtual files. If this is possible it will certainly be an advantage. If you haven't already guessed I am an IVM administrator for HP Openview, but as I use both IVM and LDOM every day it would be nice if they were both capable of doing similar things.

    Cheers

    Steve

    Posted by Steve Luther on July 24, 2009 at 06:50 PM PDT #

    Hi Steve,

    On the second question, yes certainly you can copy the image file and reuse it after changing the host information. As a matter of fact, if your original image was on a ZFS zvol you could use the zfs clone operation to do this even better (avoiding copying the blocks which would be the same in two images, which would be a LOT of blocks given that we are talking about OS image).

    On the first question, restriction on the kind of shared devices comes from the Cluster requirement to fence the shared devices. With SC32U2, you can try disabling fencing on the device (using the cldevice command, option default_fencing). That should work. If not, just holler.

    And once your evaluation finishes, it would be great to hear about your experience.

    Hope that helps,
    -ashu

    Posted by ashu on July 27, 2009 at 07:27 AM PDT #

    Cheers Ashu,

    I will look into it and keep you updated as to my progress.

    Cheers
    Steve

    Posted by Steve Luther on July 27, 2009 at 06:38 PM PDT #

    Hi Ashu,

    One more quick question, the mkfile command creates files with -rw-------T file permissions, what is the T for and how can I get it on to the copied virtual file.

    Cheers

    Steve

    Posted by Steve Luther on August 03, 2009 at 07:08 PM PDT #

    Hi Steve,

    The sticky bit (T) is turned on by the mkfile for historical reasons. Briefly: This is because mkfile was most often used to create temporary swap files. Turning on the sticky bit on such files was reasonable because it meant that the OS would try to keep pages from the swapfile into memory. With the current Solaris implementation of the virtual memory subsystem, this is no longer necessary or useful. mkfile, of course, is used now a days for many varied usage. I don't believe Solaris virtual memory implementation even looks at this flag anymore, but i did not actually double check that.

    So, i would say don't worry about it. But if you must, you can always do "chmod +t " on the copied file. Since the execute permission for others is off, this is the magic combination to turn on this flag.

    HTH,
    -ashu

    Posted by ashu on August 04, 2009 at 02:15 AM PDT #

    Hi Ashu,

    I recently setup an environment with Sun Cluster in Logical Domains and it works well. The stage it failed at complete storage LUN failure on one cluster node under Logical Domain. Or in other way, all LUN path failed. Logical Domain just hangs and does nothing? Is this because of the above I/O control domain limitation? I wanted it to reboot immediately it finds all paths to the vdisk are failed, and once rebooted, move all services back to other node. Currently I have to manually reboot the box, although services move well to the other node.
    Appreciate your early responde.

    Thanks
    Vijay

    Posted by vijay upreti on August 05, 2009 at 04:33 AM PDT #

    Hi Vijay,

    Handling of storage failures by SC should be no different between regular Solaris nodes and LDoms.

    I presume you have looked at scdpm(1M) man page and played with the settings? What did it report for the failed disk from different nodes?

    Regards,
    -ashu

    Posted by ashu on August 05, 2009 at 05:04 AM PDT #

    Thanks Ashu

    Steve

    Posted by Steve Luther on August 06, 2009 at 09:10 PM PDT #

    Hi Ashu,

    Nope, I didn't play around with any settings. And what I can see is that with the LUN path failures, I can see the Logical Domain itself hangs, without cluster. So looks like some other weired Ldom issue?

    Thanks
    Vijay

    Posted by Vijay Upreti on September 16, 2009 at 12:18 AM PDT #

    Hi Vijay,

    I believe the default behaviour for vdisks is to wait forever for the I/O to resume. This would show up as "hang" if the failed vdisk happened to host guest domain OS. Try specifying "timeout=30" to the add-vdisk command. That would mean the I/O would return with EIO after 30 seconds if the service domain is down. If the vdisk hosted the OS, you are still hosed, Solaris cannot continue and would most likely panic, but at least you get something tangible (a panic!) to look at, instead of sitting around wondering what is going on.

    So much for the virtual disk behaviour itself. Things get even more interesting if we are talking about a cluster and failures of shared disks. Here the SC's disk path monitoring framework comes into play. Once you have played with the vdisk timeouts a bit and understand the behaviour, i encourage you to experiment with cluster disk path monitoring. Take a look at "man -M /usr/cluster/man/ scdpm" to get an idea of what you can do with the feature.

    HTH,
    -ashu

    Posted by ashu on September 16, 2009 at 08:12 AM PDT #

    Hi Ashu,

    Strange question I know but can I switch off multithreading on a T5440.

    Reason being is we need to do some performance tests and the fastest machien I have by far is a T5440 with 64GB RAM and 4x1.2ghz cpus.

    We ran a quick performance test using a gzip file which is single threaded and timed it, we ran the same test on a E4500 with 3.5gb RAM and 6x400mhz CPUs and the E4500 was quicker. we ran similar tests on HPUX machiens and the results were 10 times faster. I set up an LDom on the T5440 to with 4 vCPUs adn 16gb ram and the test was 10 seconds quicker than it was on the host T5440 with no LDom software configured. For some reason Oracle also did not like the 256 CPUs it found in the machine and our own messenging softare was incredibly slow on the T5440. Do you have any ideas what could be going wrong? after all a 90 grand T5440 should run rings round an old E4500.

    Cheers

    Steve

    Posted by Steve Luther on December 08, 2009 at 09:51 PM PST #

    Hi Steve,

    Performance is a tricky question and so i my have to consult with others a bit. I can only mention broad strokes here that the CMT machines really are targeted towards multi-threaded workloads and so single thread performance comparison is not that interesting.

    So, to me, the really interesting scenario is the fact that somehow by installing/configuring LDoms, your system got quicker? Perhaps the "10 seconds quicker" that you mention is within the error margin? What percentage is that?

    Not sure i understood the point about "Oracle not liking the 256 CPUs it found", could you say a bit more about what sort of issue did you see? And while we are at it, can you mention the LDoms version, and OS versions and the guest and the I/O domains?

    I might be able to make slightly more focused comments after consulting with other folks, once i have some more understanding of what you are seeing.

    Regards,
    -ashu

    Posted by ashu on December 09, 2009 at 04:49 AM PST #

    Hi ALL,

    I am looking for a "Sun Blue Print" for implementing “Two “Two NODE cluster” with in “TWO 5120””. Can you please point me in the right direction.

    Thanks
    Raj

    CLUSTER A ( LDOM 1 ( A1 NODE) from FIRST (5120) and LDOM 1 (A2 NODE) from SECOND (5120))

    CLUSTER B ( LDOM 2 ( B1 NODE) from FIRST (5120) and LDOM 2 (B2 NODE) from SECOND (5120))

    and have following resource on each 5120's
    e1000g1, e1000g2, e1000g3, e1000g4, e1000g5, e1000g6
    2 x dual channel HBA

    Posted by Rajesh Bhabaraju on February 15, 2010 at 05:26 AM PST #

    I'm having trouble getting LDOM 1.3 with Sun Cluster 3.2 up. I keep getting the error "Cannot register devices as HA" and can't find any information about it.

    Any thoughts?

    I use EMC Powerpath and have exported both the boot disk & quorum disks (seperate disks) as /dev/dsk/emcpowerXc (which is the backup particion)
    Thanks

    Posted by Murali Nagulakonda on February 16, 2010 at 09:03 PM PST #

    @Rajesh,

    Start with:

    http://www.sun.com/software/solaris/pdf/cluster_ldoms_final_hi_res.pdf

    It talks about creating just One 2 node cluster. But creating another one is easy, just create the two guest domains (one on each server) and install cluster in them. You can reuse the same cluster private interconnect for the second cluster as well, but make sure that the private network for the second cluster is DIFFERENT. In other words, while installing the second cluster do not accept the default private network, instead choose another one. Everything else should be straight-forward. [PS: this point is noted in the documentation mentioned in the blog.]

    Post here if you run into anything, or succeed at first try itself.

    HTH,
    -ashu

    Posted by ashu on February 26, 2010 at 04:48 AM PST #

    If you're using EMC Power devices, you probably don't want to do it on LDOMs because EMC Powerpath and LDOMs dont play very friendly with Sun Cluster Reservation is what I heard.

    I'm having trouble getting Sun HA-NFS on LDOM working even though the Sun Cluster itself works fine (haven't tried with killing a path to the storage though).
    MPXIO may be fine though.

    Posted by Murali Nagulakonda on February 26, 2010 at 04:54 AM PST #

    @Murali,

    From what i can tell, you are exporting the EMC devices correctly and those should show up in the cluster just fine. EMC powerpath can be tricky at most times, when you throw virtualization at it, it is best to raise a support call instead of debug the problem in a blog! :-) :-)

    HTH,
    -ashu

    Posted by ashu on February 26, 2010 at 04:56 AM PST #

    @Ashu,

    It turned out, EMC Powerpath devices cant be exported directly to the LDOM for the Sun Cluster to create DID devices.
    I had to export the underlying Cxtxdxs2 device via VDSDEV while creating a MPGROUP for the different paths and then giving that disk to the LDOM.

    That is the only way Sun Cluster picked up the shared device. Otherwise it was not able to register devices and would give me CCR Failures.

    Posted by Murali Nagulakonda on February 26, 2010 at 05:01 AM PST #

    Hi Murali,

    What doesn't work with cluster reservations is LDoms own disk multipathing solution called "vdisk failover", enabled by the "mgroup" option on the add-vdisk command. That is a very different solution which is intended to work with multiple I/O domains on the server.

    AFAIK, if you are using storage multipathing solution in the control domain and exporting the powerpath device, the LDoms vdisk/vdsdev drivers work fine as long as you have the proper Solaris fixes in the control domain and guest domain.

    As i mentioned above, your problem is a bit too specific to debug in a blog. Suggest you engage the support folks and let them analyze it. They would ask you for any additional data if they need it.

    HTH,
    -ashu

    Posted by ashu on February 26, 2010 at 05:13 AM PST #

    @Ashu
    Interesting. It was the Sun Cluster support guy who told me to back off from PowerDevice and actually export the cXtXdX device in a MP group which got the cluster working.

    This is troubling now. If the power device should work it should. At first he thought it was a Solaris 11 bug that leaked into Solaris 10. But then he backed off and told me the cXtXdX solution.

    I found an article on EMC that said that Powerpath 5.2 and higher is supported on LDOM Sun Clusters.

    This is troubling now. Can I trust the Sun Cluster guy/

    Posted by Murali Nagulakonda on February 26, 2010 at 05:17 AM PST #

    Hi Murali,

    Looks like our last posts crossed in the ether! :-)

    At this point in time, i would suggest that you keep an open mind. While it is true that mgroups don't work with the cluster reservations, perhaps in your configuration, they don't have to. Because you also have Powerpath in the stack and maybe it would take responsibility for moving about the reservations in case of disk path failures.

    The above is just speculation, of course. Suggest that you have a very-very careful validation plan in place which makes sure that the whole configuration works correctly in the presence of failures both transitory disk path failures as well as permanent disk path failures and reboots of the control domain. This validation should be performed while there is lots of active I/O going on in the presence of cluster reconfigurations (which interacts with the reservations).

    To repeat, please keep working with the SUN support people and bring your concerns to them, but please keep an open mind on what is the best way to deploy these technologies together.

    HTH,
    -ashu

    Posted by ashu on February 26, 2010 at 07:58 AM PST #

    Post a Comment:
    • HTML Syntax: NOT allowed
    About

    mkb

    Search

    Archives
    « April 2014
    SunMonTueWedThuFriSat
      
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
       
           
    Today