Friday Dec 11, 2015

New resource type version for Solaris Cluster SUNW.ldom agent

In one of my previous blogs I described:
"Configure failover LDom (Oracle VM Server for SPARC) on Solaris Cluster 4.1 by using 'live' migration"

The setup of failover LDom with 'live' migration is still the same in Solaris Cluster 4.2 and the recently released Solaris Cluster 4.3. Except one significant change to setup the password file for non-interactive 'live' migration with SUNWldom:6 version and higher

  • The Solaris Cluster 4.2 SRU4.1 delivers resource type version 6 of SUNW.ldom.
  • The Solaris Cluster 4.3 delivers resource type version 7 of SUNW.ldom.
  • The described method in
    Configure failover LDom (Oracle VM Server for SPARC) on Solaris Cluster 4.1 by using 'live' migration is only for SUNW.ldom:5 or lower.

    To check the used version of SUNW.ldom do:
    # clrt list | grep ldom
    SUNW.ldom:7


    Therefore this blog describe:

  • I) Configure failover LDom (Oracle VM Server for SPARC) on Solaris Cluster with SUNW.ldom:6 or higher by using 'live' migration

  • II) How to upgrade from SUNW.ldom:5 version or lower to SUNWldom:6 version or higher



  • Let's start:
    I) Configure failover LDom (Oracle VM Server for SPARC) on Solaris Cluster with SUNW.ldom:6 or higher by using 'live' migration
    A) Start with
    Configure failover LDom (Oracle VM Server for SPARC) on Solaris Cluster 4.1 by using 'live' migration
    until step 16.

    B) Instead of step 17 and step 18 of the above document do:

    17) Setup password file for non-interactive “live” migration on one primary:
        # clpstring create -t resource –b fgd0-rs fldom-rg_fgd0-rs_ldompasswd
        Enter string value:
        Enter string value again:

    This new approach is also mentioned in the new:
    SPARC: How to Configure HA for Oracle VM Server of Oracle Solaris Cluster Data Service for Oracle VM Server for SPARC Guide

    18) Create SUNW.ldom resource:
    # clrs create -g fldom-rg -t SUNW.ldom -p Domain_name=fgd0 -p resource_dependencies=fgd0-has-rs fgd0-rs

    C) Continue with step 19 of
    Configure failover LDom (Oracle VM Server for SPARC) on Solaris Cluster 4.1 by using 'live' migration



    II) How to upgrade from SUNW.ldom:5 version or lower to SUNWldom:6 version or higher
    The upgrade is quite easy and can be done when the failover LDom is running.
    A) Check current setup:
        # scrgadm -pvv | grep Password_file | grep value
           (fgldom-rg:fgd0-rs:Password_file) Res property value: /var/cluster/.pwfgldom
        # clrt list | grep ldom
        SUNW.ldom:5

    From the outputs we can identify
    the resource group (RG) name ‘fgldom-rg’
    the resource (RS) name ‘fgd0-rs’
    the resource type (RT) version 5 of SUNW.ldom
    the used password file /var/cluster/.pwfgldom

    B) Create the new password file:
        # clpstring create -t resource -b fgd0-rs fldom-rg_fgd0-rs_ldompasswd
        Enter string value:
        Enter string value again:
        # clpstring show
        Pstring Name:          fldom-rg_fgd0-rs_ldompasswd
         Object Instance:          fgd0-rs
         Object Type:                resource

    C) Verify available RT version:
        # grep RT_version /opt/SUNWscxvm/etc/SUNW.ldom
        RT_version ="6";

    D) Upgrade the RT
         1) Register version 6:
            # clrt register SUNW.ldom:6
            # clrt list | grep ldom
            SUNW.ldom:5
            SUNW.ldom:6
         2) Update failover LDom resource properties with new values:
            # clrs set -p Type_version=6 fgd0-rs
            # clrs set -p Password_file="" fgd0-rs

    E) If wished test switchover
        # clrg switch -n <node2> fgldom-rg
        # clrg switch -n <node1> fgldom-rg

    F) Finally clean up the system
         1) Remove password /var/cluster/.pwfgldom on all primary nodes
         2) Unregister old RT version
            # clrt unregister SUNW.ldom:5


    In Solaris Cluster 4.2.4.1.0 (SRU 4.1) ReadMe (Doc ID 2013413.1) this is described as:
    ###########################
    Note 5: Creating a new SUNW.ldom resource type

    To create a new resource of type SUNW.ldom:6, first store the password required for HA-ldom migration in a private string named RG_RS_ldompasswd, where RG and RS are the resource group name and resource name:

    # /usr/cluster/bin/clpstring create -t resource -b RS RG_RS_ldompasswd

    To upgrade an existing resource of type SUNW.ldom:5 or earlier, execute the following commands where RS is the resource name:

    # /usr/cluster/bin/clrs set -p Type_version=6 RS
    # /usr/cluster/bin/clrs set -p Password_file="" RS

    The Password_file extension property is obsolete for new resources as of type SUNW.ldom:6. If you upgrade an existing resource to SUNW.ldom:6, do not delete the old password files until after you have completed the RT upgrade process described above.
    ###########################

    Example if trying to use the old style of the password file (like Password_file=/var/cluster/.pwfgd0) with SUNW.ldom:6 or higher, then this will fail with:
    # clrs create -g fldom-rg -t SUNW.ldom -p Domain_name=fgd0 -p Password_file=/var/cluster/.pwfgd0 -p resource_dependencies=fgd0-has-rs fgd0-rs
    clrs: node1 - Password_file property is obsolete for RT version>5, use private string instead.
    clrs: (C189917) VALIDATE on resource fgd0-rs, resource group fldom-rg, exited with non-zero exit status.
    clrs: (C720144) Validation of resource fgd0-rs in resource group fldom-rg on node node1 failed.
    clrs: (C891200) Failed to create resource "fgd0-rs".

    in /var/adm/messages file:
    node1 SC[SUNWscxvm.validate]:fldom-rg:fgd0-rs: Password_file property is obsolete for RT version>5, use private string instead.
    node1 Cluster.RGM.global.rgmd: VALIDATE failed on resource <fgd0-rs>, resource group <fldom-rg>, time used: 0% of timeout <300, seconds>



    In summary: Until SUNWldom:5 use the Password_file property. And with SUNWldom:6 and higher use the clpstring command to setup the password for failover LDom "live" migration

    Tuesday Apr 22, 2014

    Configure failover LDom (Oracle VM Server for SPARC) on Solaris Cluster 4.1 by using 'live' migration

    This blog shows an example to configure 'Oracle Solaris Cluster Data Service for Oracle VM Server for SPARC' on Solaris Cluster 4.1. It also mentions some hints around such a configuration. For this setup Solaris Cluster 4.1 SRU3 or higher and Oracle VM Server 3.0 or higher is required.
    At least this is a summary of
    Oracle Solaris Cluster Data Service for Oracle VM Server for SPARC Guide
    and
    Oracle VM Server for SPARC 3.0 Administration Guide.
    Please check these guides for further restrictions and requirements.

    This procedure is especially for 'live' migration of guest LDom's which means no shutdown of the OS in the LDom within the failover. In earlier OVM releases this was called 'warm' migration. However, the word 'live' is used in this example. A 'cold' migration means that the OS in the guest LDom will be stopped before migration.

    Let's start:
    The necessary services must be identical on all the potential control domains (primary domains) which run as Solaris Cluster 4.1 nodes. It is expected that Oracle VM Sever is already installed.

    1) Prepare all primary domains which should manage the failover LDom with the necessary services.
    all_primaries# ldm add-vconscon port-range=5000-5100 primary-vcc0 primary
    all_primaries# svcadm enable svc:/ldoms/vntsd:default
    all_primaries# ldm add-vswitch net-dev=net0 public-vsw1 primary
    all_primaries# ldm add-vdiskserver primary-vds0 primary

    To verify:
    all_primary# ldm list-bindings primary


    2) Set failure-policy on all primary domains:
    all_primaries# ldm set-domain failure-policy=reset primary
    To verify:
    all_primaries# ldm list -o domain primary


    3) Create failover guest domain (fgd0) on one primary domain.
    Simple example:
    primaryA# ldm add-domain fgd0
    primaryA# ldm set-vcpu 16 fgd0
    primaryA# ldm set-mem 8G fgd0



    4) Add public network to failover guest domain:
    primaryA# ldm add-vnet public-net0 public-vsw1 fgd0
    To verify:
    primaryA# ldm list-bindings fgd0

    For more details to setup guest LDoms refer to Oracle VM Server for SPARC 3.0 Administration Guide.


    5) Set necessary values on failover guest domain fgd0:
    primaryA# ldm set-domain master=primary fgd0
    primaryA# ldm set-var auto-boot?=false fgd0

    To verify run:
    primaryA# ldm list -o domain fgd0
    auto-boot?=false is a “must have” to prevent data corruption. More details available in DocID 1585422.1 Solaris Cluster and HA VM Server Agent SUNW.ldom Data Corruption may Occur in a failover Guest Domain when "auto-boot?=true" is set


    6) Select boot device for failover guest domain fgd0.
    Possible options for the root file system of a domain with 'live' migration are: Solaris Cluster global filesystem (UFS/SVM), NFS, iSCSI, and SAN LUNs because all accessible at the same time from both nodes. The recommendation is to use full raw disk because it's expected to do 'live' migration. The full raw disk can be provided via SAN or iSCSI to all primary domains.
    Remember zfs as root filesystem can ONLY be used if doing 'cold' migration because for 'live' migration both nodes need to access the root file system at the same time which is not possible with zfs.
    Using Solaris Cluster global filesystem is an alternative but the performance is not that good as root on raw disk.
    Details available in DocID 1366967.1 Solaris Cluster Root Filesystem Configurations for a Guest LDom Controlled by a SUNW.ldom Resource

    So, root on raw filesystem is selected.
    Add boot device to fgd0:
    all_primaries# ldm add-vdsdev /dev/did/rdsk/d7s2 boot_fgd0@primary-vds0
    primaryA# ldm add-vdisk root_fgd0 boot_fgd0@primary-vds0 fgd0



    6a) Optional: Configure MAC addresses of LDom. The LDom Manager assign MAC automatically but the following issues can occur:
    * Duplicate MAC address if other guest LDom's are down when creating a new LDom.
    * MAC address can change after failover of a LDom
    Assign your own MAC address is possible. This example use the suggested range between 00:14:4F:FC:00:00 – 00:14:4F:FF:FF:FF as described in
    Assigning MAC Addresses Automatically or Manually of Oracle VM Server for SPARC 3.0 Administration Guide.
    Example:
    Identify current automatically assigned MAC addresses
    primaryA# ldm list -l fgd0
    to see the HOSTID which is similar as MAC a 'ldm bind fldg0' is necessary. Unbind fldg0 afterwards with 'ldm unbind fldg0'
    MAC: 00:14:4f:fb:50:dc → change to 00:14:4f:fc:50:dc
    HOSTID: 0x84fb50dc → change to 0x84fb50dd
    public-net: 00:14:4f:fa:01:49 → change to 00:14:4f:fc:01:49
    primaryA# ldm set-domain mac-addr=00:14:4f:fc:50:dc fgd0
    primaryA# ldm set-domain hostid=0x84fb50dd fgd0
    primaryA# ldm set-vnet mac-addr=00:14:4f:fc:01:49 public-net0 fgd0
    primaryA# ldm list-constraints fgd0 (this shows assigned MAC now)
    For more details or if necessary to change the MAC addresses on already configured failover guest LDom then refer to DocID 1559415.1 Solaris Cluster HA-LDom Agent do not Preserve hostid and MAC Address Upon Failover


    7) Bind and start the fgd0
    primaryA# ldm bind fgd0
    primaryA# ldm start fgd0


    8) Login to LDom using console:
    primaryA# telnet localhost 5000


    9) Install Solaris10 or Solaris 11 on LDom by using install server
    To identify MAC address of LDom do in the console of fgd0:
    {0} ok devalias net
    {0} ok cd /virtual-devices@100/channel-devices@200/network@0
    {0} ok .properties
    local-mac-address 00 14 4f fc 01 49

    For different installation method please refer to Installing Oracle Solaris OS on a Guest Domain of Oracle VM Server for SPARC 3.0 Administration Guide


    10) Install HA-LDom (HA for Oracle VM Server Package) on all primary domain nodes if not already done
    all_primaries# pkg info ha-cluster/data-service/ha-ldom
    all_primaries# pkg install ha-cluster/data-service/ha-ldom



    11) Check that cluster is first entry in /etc/nsswitch.conf
    all_primaries# svccfg -s name-service/switch listprop config/host
    config/host astring "files dns"
    all_primaries# svccfg -s name-service/switch listprop config/ipnodes
    config/ipnodes astring "files dns"
    all_primaries# svccfg -s name-service/switch listprop config/netmask
    config/netmask astring files
    If not add it:
    all_primaries# svccfg -s name-service/switch setprop config/host = astring: '("cluster files dns")'
    all_primaries# svccfg -s name-service/switch setprop config/ipnodes = astring: '("cluster files dns")'
    all_primaries# svccfg -s name-service/switch setprop config/netmask = astring: '("cluster files")'

    More Details in DocID 1554887.1 Solaris Cluster: HA LDom Migration Fails With "Failed to establish connection with ldmd(1m) on target"


    12) Create resource group for failover LDom fgd0 for primiary domains
    primaryA# clrg create -n primaryA,primaryB fldom-rg


    13) Register SUNW.HAStoragePlus if not already done:
    primaryA# clrt register SUNW.HAStoragePlus


    14) Create HAStoragePlus resource for boot device:
    primaryA# clrs create -g fldom-rg -t SUNW.HAStoragePlus -p GlobalDevicePaths=/dev/global/dsk/d7s2 fgd0-has-rs
    To use d7s2 is a requirement!!!


    15) Enable LDom resrouce group on current node:
    primaryA# clrg online -M -n primaryA fldom-rg


    16) Register SUNW.ldom
    primaryA# clrt register SUNW.ldom


    17) Setup password file for non-interactive 'live' migration on all primary nodes
    all_primaries# vi /var/cluster/.pwfgd0
    add root password to this file
    all_primaries# chmod 400 /var/cluster/.pwfgd0
    Reguirements:
    * The first line of the file must contain the password
    * The password must be plain text
    * The password must not exceed 256 characters in length
    A newline character at the end of the password and all lines that follow the first line are ignored.
    These details from Performing Non-Interactive Migrations of Oracle VM Server for SPARC 3.0 Administration Guide


    Attention: If you are using SUNW.ldom:6 or higher this kind of password setup will fail. Also the alternative 17a) does not work with SUNW.ldom:6 or higher. For details please refer to my blog
    New resource type version for Solaris Cluster SUNW.ldom agent


    17a) Alternative: Setup encrypted password file for non-interactive 'live' migration on all primary nodes
    all_primaries# echo "encrypted" > /var/cluster/.pwfgd0
    all_primaries# dd if=/dev/urandom of=/var/cluster/ldom_key bs=16 count=1
    all_primaries# chmod 400 /var/cluster/ldom_key
    all_primaries# echo fu_bar | /usr/sfw/bin/openssl enc -aes128 -e -pass file:/var/cluster/ldom_key -out /opt/SUNWscxvm/.fgd0_passwd
    all_primaries# chmod 400 /opt/SUNWscxvm/.fgd0_passwd

    The root password for failover LDom is "fu_bar" which will be encrypted. All files must be secured using "chmod 400". Both /var/cluster/ldom_key and /opt/SUNWscxvm/.{DOMAIN}_passwd file can NOT be placed in a different location and can NOT have a different name.

    Verify if encrypted password can be decrypted:
    all_primaries# /usr/sfw/bin/openssl enc -aes128 -d -pass file:/var/cluster/ldom_key -in /opt/SUNWscxvm/.fgd0_passwd
    More Details in DocID 1668567.1 Solaris Cluster HA-LDom Fails Doing 'live' Migration with "normal failover will be performed" or "Password cannot be longer than 256 characters" due to Wrong Value in 'Password_file' resource property


    18) Create SUNW.ldom resource
    primaryA# clrs create -g fldom-rg -t SUNW.ldom -p Domain_name=fgd0 -p Password_file=/var/cluster/.pwfgd0 -p resource_dependencies=fgd0-has-rs fgd0-rs

    Notice: The domain configuration is retrieved by the “ldm list-constraints -x ldom” command from Solaris Cluster and stored in the CCR. This info is used to create or destroy the domain on the node where the resource group is brought online or offline.


    19) Check Migration_type property. It should be MIGRATE for 'live' migration:
    primaryA# clrs show -v fgd0-rs | grep Migration_type
    If not MIGRATE then set it:
    primaryA# clrs set -p Migration_type=MIGRATE fgd0-rs


    20) To stop/start the SUNW.ldom resource
    primaryA# clrs disable fgd0-rs
    primaryA# clrs enable fgd0-rs


    21) Verify the setup by switching failover LDom to other node and back.
    primaryA# clrg switch -n primaryB fldom-rg
    primaryA# clrg switch -n primaryA fldom-rg
    To monitor the migration process run 'ldm list -o status fldg0' on the primary target domain.


    22) Tune your timeout values depending on your system.
    primaryA# clrs set -p STOP_TIMEOUT=1200 fgd0-rs
    Details in DocID 1423937.1 Solaris Cluster: HA LDOM Migration Fails With "Migration of domain timed out, the domain state is now shut off"


    23) Consider further tuning of timeout values as described in
    SPARC: Tuning the HA for Oracle VM Server Fault Monitor of Oracle Solaris Cluster Data Service for Oracle VM Server for SPARC Guide
    For less frequent probing maybe the following setting can be used.
    primaryA # clrs set -p Thorough_probe_interval=180 -p Probe_timeout=90 fgd0-rs

    Last but not least, it's not supported to run a 2-node or more-node Solaris Cluster within a failover LDom!
    BUT with SC4.1 SRU4 or higher you can run a single-node Solaris Cluster within failover LDom.
    For details please refer to Application monitoring in Oracle VM for SPARC failover guest domain within Doc ID 1597319.1 Oracle Solaris Cluster Product Update Bulletin October 2013

    About

    I'm still mostly blogging around Solaris Cluster and support. Independently if for Sun Microsystems or Oracle. :-)

    Search

    Archives
    « February 2016
    SunMonTueWedThuFriSat
     
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
         
           
    Today