Monday Apr 06, 2009

Controlling [Virtual] Network Interfaces in a Non-Global Solaris Zone

In the software world, some tools like SAP NetWeaver's Adaptive Computing Controller (ACC) require full control over a network interface, so they can bring up/down the NICs at their will to fulfill their responsibilities. Those tools may function normally on Solaris 10 [and later] as long as they are run in the global zone. However there might be some trouble when those tools are attempted to run in a non-global zone, especially on machines with only one physical network interface installed, and when the non-global zones are created with the default configuration. This blog post attempts to suggest few solutions to get around those issues, so the tools can function the way they normally do in the global zone.

If the machine has only one NIC installed, there are at least two issues that will prevent tools like ACC from working in a non-global zone.

  1. Since there is only one network interface on the system, it is not possible to dedicate that interface to the non-global zone where ACC is supposed to run. Hence all the zones, including the global zone, must share the physical network interface.
  2. When the physical network interface is being shared across multiple zones, it is not possible to plumb/unplumb the network interface from a Shared-IP Non-Global Zone. Only the root users in the global zone can plumb/unplumb the lone physical network interface.
    • When a non-global zone is created with the default configuration, Shared-IP zone is created by default. Shared-IP zones have separate IP addresses, but share the IP routing configuration with the global zone.

Fortunately, Solaris 10 has a solution to the aforementioned issues in the form of Network Virtualization. Crossbow is the code name for network virtualization in Solaris. Crossbow provides the necessary building blocks to virtualize a single physical network interface into multiple virtual network interfaces (VNICs) - so the solution to the issue at hand is to create a virtual network interface, and then to create an Exclusive-IP Non-Global Zone using the virtual NIC. Rest of the blog post demonstrates the simple steps to create a VNIC, and to configure a non-global zone as Exclusive-IP Zone.

Create a Virtual Network Interface using Crossbow

  • Make sure the OS has Crossbow functionality
    
    global# cat /etc/release
                     Solaris Express Community Edition snv_111 SPARC
               Copyright 2009 Sun Microsystems, Inc.  All Rights Reserved.
                            Use is subject to license terms.
                                 Assembled 23 March 2009
    
    

    Crossbow has been integrated into Solaris Express Community Edition (Nevada) build 105 - hence all Nevada builds starting with build 105 will have the Crossbow functionality. OpenSolaris 2009.06 and the next major update to Solaris 10 are expected to have the support for network virtualization out-of-the-box.

  • Check the existing zones and the available physical and virtual network interfaces.
    
    global# zoneadm list -cv
      ID NAME             STATUS     PATH                           BRAND    IP    
       0 global           running    /                              native   shared
    
    global# dladm show-link
    LINK        CLASS    MTU    STATE    OVER
    e1000g0     phys     1500   up       --
    
    

    In this example, there is only one NIC, e1000g0, on the server; and there are no non-global zones installed.

  • Create a virtual network interface based on device e1000g0 with an automatically generated MAC address. If the NIC has factory MAC addresses available, one of them will be used. Otherwise, a random address is selected. The auto mode is the default action if none is specified.
    
    global# dladm create-vnic -l e1000g0 vnic1
    
    
  • Check the available network interfaces one more time. Now you should be able to see the newly created virtual NIC in addition to the existing physical network interface. It is also possible to list only the virtual NICs.
    
    global# dladm show-link
    LINK        CLASS    MTU    STATE    OVER
    e1000g0     phys     1500   up       --
    vnic1       vnic     1500   up       e1000g0
    
    global# dladm show-vnic
    LINK         OVER         SPEED  MACADDRESS           MACADDRTYPE         VID
    vnic1        e1000g0      1000   2:8:20:32:9:10       random              0
    
    

Create a Non-Global Zone with the VNIC

  • Create an Exclusive-IP Non-Global Zone with the newly created VNIC being the primary network interface.
    
    global # mkdir -p /export/zones/sapacc
    global # chmod 700 /export/zones/sapacc
    
    global # zonecfg -z sapacc
    sapacc: No such zone configured
    Use 'create' to begin configuring a new zone.
    zonecfg:sapacc> create
    zonecfg:sapacc> set zonepath=/export/zones/sapacc
    zonecfg:sapacc> set autoboot=false
    zonecfg:sapacc> set ip-type=exclusive
    zonecfg:sapacc> add net
    zonecfg:sapacc:net> set physical=vnic1
    zonecfg:sapacc:net> end
    zonecfg:sapacc> verify
    zonecfg:sapacc> commit
    zonecfg:sapacc> exit
    
    global # zoneadm -z sapacc install
    
    global # zoneadm -z sapacc boot
    
    global #  zoneadm list -cv
      ID NAME             STATUS     PATH                           BRAND    IP    
       0 global           running    /                              native   shared
       1 sapacc           running    /export/zones/sapacc        	native   excl
    
    
  • Configure the new non-global zone including the IP address and the network services
    
    global # zlogin -C -e [ sapacc
    ...
    
      > Confirm the following information.  If it is correct, press F2;             
        to change any information, press F4.                                        
                                                                                    
                                                                                    
                      Host name: sap-zone2
                     IP address: 10.6.227.134                                       
        System part of a subnet: Yes                                                
                        Netmask: 255.255.255.0                                      
                    Enable IPv6: No                                                 
                  Default Route: Detect one upon reboot                             
    
    
  • Inside the non-global zone, check the status of the VNIC and the status of the network service
    
    local# hostname
    sap-zone2
    
    local# zonename
    sapacc
    
    local# ifconfig -a
    lo0: flags=2001000849 mtu 8232 index 1
            inet 127.0.0.1 netmask ff000000 
    vnic1: flags=1000843 mtu 1500 index 2
            inet 10.6.227.134 netmask ffffff00 broadcast 10.6.227.255
            ether 2:8:20:32:9:10 
    lo0: flags=2002000849 mtu 8252 index 1
            inet6 ::1/128 
    
    local# svcs svc:/network/physical
    STATE          STIME    FMRI
    disabled       13:02:18 svc:/network/physical:nwam
    online         13:02:24 svc:/network/physical:default
    
    
  • Check the network connectivity.

    From inside the non-global zone to the outside world:

    
    local# ping -s sap29
    PING sap29: 56 data bytes
    64 bytes from sap29 (10.6.227.177): icmp_seq=0. time=0.680 ms
    64 bytes from sap29 (10.6.227.177): icmp_seq=1. time=0.452 ms
    64 bytes from sap29 (10.6.227.177): icmp_seq=2. time=0.561 ms
    64 bytes from sap29 (10.6.227.177): icmp_seq=3. time=0.616 ms
    \^C
    ----sap29 PING Statistics----
    4 packets transmitted, 4 packets received, 0% packet loss
    round-trip (ms)  min/avg/max/stddev = 0.452/0.577/0.680/0.097
    
    
    From the outside world to the non-global zone:
    
    remotehostonWAN# telnet sap-zone2
    Trying 10.6.227.134...
    Connected to sap-zone2.sun.com.
    Escape character is '\^]'.
    login: test
    Password: 
    Sun Microsystems Inc.   SunOS 5.11      snv_111 November 2008
    
    -bash-3.2$ /usr/sbin/ifconfig -a
    lo0: flags=2001000849 mtu 8232 index 1
            inet 127.0.0.1 netmask ff000000 
    vnic1: flags=1000843 mtu 1500 index 2
            inet 10.6.227.134 netmask ffffff00 broadcast 10.6.227.255
    lo0: flags=2002000849 mtu 8252 index 1
            inet6 ::1/128 
    -bash-3.2$ exit
    logout
    Connection to sap-zone2 closed.
    
    

Dynamic [Re]Configuration of the [Virtual] Network Interface in a Non-Global Zone

  • Finally try plumbing down/up the virtual network interface inside the Exclusive-IP Non-Global Zone
    
    global # zlogin -C -e [ sapacc
    [Connected to zone 'sapacc' console]
    ..
    
    zoneconsole# ifconfig vnic1 unplumb
    
    zoneconsole# /usr/sbin/ifconfig -a
    lo0: flags=2001000849 mtu 8232 index 1
            inet 127.0.0.1 netmask ff000000
    
    zoneconsole# ifconfig vnic1 plumb
    
    zoneconsole# ifconfig vnic1 10.6.227.134 netmask 255.255.255.0 up
    
    zoneconsole# /usr/sbin/ifconfig -a
    lo0: flags=2001000849 mtu 8232 index 1
            inet 127.0.0.1 netmask ff000000
    vnic1: flags=1000843 mtu 1500 index 2
            inet 10.6.227.134 netmask ffffff00 broadcast 10.6.227.255
    lo0: flags=2002000849 mtu 8252 index 1
            inet6 ::1/128
    
    

As simple as that! Before we conclude, be informed that prior to Crossbow, Solaris system administrators were required to use Virtual Local Area Networks (VLAN) to achieve similar outcomes.

Check Zones and Containers FAQ, if you are stuck with a strange situation or if you need some interesting ideas around virtualization on Solaris.

Wednesday Feb 18, 2009

Sun Blueprint : MySQL in Solaris Containers

While the costs of managing a data center are becoming a major concern with the increased number of under-utilized servers, customers are actively looking for solutions to consolidate their workloads to:

  • improve server utilization
  • improve data center space utilization
  • reduce power and cooling requirements
  • lower capital and operating expenditures
  • reduce carbon footprint, ..

To cater those customers, Sun offers several virtualization technologies such as Logical Domains, Solaris Containers, xVM at free of cost for SPARC and x86/x64 platforms.

In order to help our customers who are planning for the consolidation of their MySQL databases on systems running Solaris 10, we put together a document with a bunch of installation steps and the best practices to run MySQL inside a Solaris Container. Although the document was focused on the Solaris Containers technology, majority of the tuning tips including the ZFS tips are applicable to all MySQL instances running [on Solaris] under different virtualization technologies.

You can access the blueprint document at the following location:

        Running MySQL Database in Solaris Containers

The blueprint document briefly explains the MySQL server & Solaris Containers technology, introduces different options to install MySQL server on Solaris 10, shows the steps involved in installing and running Solaris Zones & MySQL, and finally provides few best practices to run MySQL optimally inside a Solaris Container.

Feel free to leave a comment if you notice any incorrect information, or if you have generic suggestions to improve documents like these.

Acknowledgments

Many thanks to Prashant Srinivasan, John David Duncan and Margaret B. for their help in different phases of this blueprint.

Saturday Feb 07, 2009

Mounting Windows' NTFS on [Open]Solaris x86/x64

The steps outlined in this blog post are derived from the Miscellaneous filesystem support for OpenSolaris on x86 web page. I just added few examples to illustrate the steps to mount a partition with NTFS filesystem that exists on the external hard drive (in this case, it is a Seagate FreeAgent external hard drive).

Step-by-Step instructions to mount NTFS filesystem on [Open]Solaris

  1. Install the packages : FSWpart and FSWfsmisc.

  2. Find the logical device name for the NTFS partition. -l option of the rmformat command lists all removable devices along with their device names.

    
    # rmformat -l   
    Looking for devices...
         1. Logical Node: /dev/rdsk/c1t0d0p0
            Physical Node: /pci@0,0/pci-ide@1f,1/ide@1/sd@0,0
            Connected Device: MATSHITA UJDA750 DVD/CDRW 1.60
            Device Type: DVD Reader
    	Bus: IDE
    	Size: 
    	Label: 
    	Access permissions: 
         2. Logical Node: /dev/rdsk/c2t0d0p0
            Physical Node: /pci@0,0/pci1179,1@1d,7/storage@1/disk@0,0
            Connected Device: Seagate  FreeAgentDesktop 100F
            Device Type: Removable
    	Bus: USB
    	Size: 953.9 GB
    	Label: 
    	Access permissions: 
    
    
  3. Identify the NTFS partition on the external disk with the help of fdisk
    
    # fdisk /dev/rdsk/c2t0d0p0
                 Total disk size is 60800 cylinders
                 Cylinder size is 32130 (512 byte) blocks
    
                                                   Cylinders
          Partition   Status    Type          Start   End   Length    %
          =========   ======    ============  =====   ===   ======   ===
              1                 IFS: NTFS         0  60800    60801    100
    
    SELECT ONE OF THE FOLLOWING:
       1. Create a partition
       2. Specify the active partition
       3. Delete a partition
       4. Change between Solaris and Solaris2 Partition IDs
       5. Exit (update disk configuration and exit)
       6. Cancel (exit without updating disk configuration)
    Enter Selection: 6
    #
    
    

    In this example, partition #1 i.e., c2t0d0p1 has the NTFS filesystem.

  4. Mount the NTFS partition just like mounting an UFS filesystem using the mount command. Use the argument ntfs to the command line option -F. Since the filesystem was mounted in a slightly different way than the conventional way, use /usr/bin/xlsmounts to see the detailed mount table information.

    
    # mount -F ntfs /dev/dsk/c2t0d0p1 /mnt
    
    # /usr/bin/xlsmounts
      PHYSICAL DEVICE                 LOGICAL DEVICE      FS    PID         ADDR Mounted on
    /dev/dsk/c2t0d0p1              /dev/dsk/c2t0d0p1    ntfs   6755  127.0.0.1:/ /mnt
    
    # ls /mnt
    expForSun.dmp             MySQL5.1                   RECYCLER
    medium-64-bit             $RECYCLE.BIN               System Volume Information
    
    

    Notice the 127.0.0.1:/ under ADDR column in the output of xlsmounts. NTFS mount uses userland NFSv2 server to access the filesystems on raw partitions. That is why the mount was shown as NFS client mounted from 127.0.0.1:/

  5. To unmount the NTFS filesystem, use /usr/bin/xumount. Solaris standard umount command unmounts the filesystem but does not terminate the background NFS server process.

    
    # /usr/bin/xumount /mnt
    
             - OR -
    
    # /usr/bin/xumount /dev/dsk/c2t0d0p1
    
    

Check the Miscellaneous filesystem support for OpenSolaris on x86 page and Moinak Ghosh's blog post Mount and Access NTFS and Ext2FS from Solaris x86 for the rest of the fine details.

Sunday Feb 01, 2009

Workaround to "eject of cdrom /dev/dsk/cxtxdxsx failed" Error on SXCE b106

Symptom

On Solaris Express Community Edition build 106, eject(1) fails intermittently (esp. when inserted the blank media) with the error: eject of cdrom /dev/dsk/cxtxdxsx failed.


# cat /etc/release
                  Solaris Express Community Edition snv_106 X86
           Copyright 2009 Sun Microsystems, Inc.  All Rights Reserved.
                        Use is subject to license terms.
                            Assembled 13 January 2009

# eject
eject of cdrom /dev/dsk/c1t0d0s2 failed: A security policy in place prevents this sender 
from sending this message to this recipient, see message bus configuration file (rejected 
message had interface "org.freedesktop.Hal.Device.Storage" member "Eject" error name 
"(unset)" destination "org.freedesktop.Hal")

Solution

Wait for snv_107 build. Meanwhile check the bug #6791982.

Workaround
  1. Edit /etc/dbus-1/system.d/hal.conf. Add the following line to the <policy context="default"> section.
    <allow send_interface="org.freedesktop.Hal.Device.Storage"/>
  2. Restart the D-Bus message bus system service.
    svcadm restart svc:/system/dbus:default

    On a totally unrelated note, to see all the existing SMF services with a simple description, run:

    svcs -o FMRI,DESC
  3. Finally try to eject the CD/DVD disk by typing the eject command.

Sunday Nov 30, 2008

PeopleSoft on Solaris 10: Fixing the "msgget: No space left on device" Error

(Crossposting the 8+ month old blog entry from my other blog hosted on blogger. Source URL:
http://technopark02.blogspot.com/2008/03/peoplesoft-fixing-msgget-no-space-left.html
)

When a large number of application server processes are configured in a single PeopleSoft domain or in multiple domains cumulative, it is very likely that the PeopleSoft application server domain boot process may fail with errors like:

Booting server processes ...
exec PSSAMSRV -A -- -C psappsrv.cfg -D CS90SPV -S PSSAMSRV :
        Failed.
113954.ben15!PSSAMSRV.29746.1.0: LIBTUX_CAT:681: ERROR: Failure to create message queue
113954.ben15!PSSAMSRV.29746.1.0: LIBTUX_CAT:248: ERROR: System init function failed, Uunixerr = : 
                   msgget: No space left on device
113954.ben15!tmboot.29708.1.-2: CMDTUX_CAT:825: ERROR: Process PSSAMSRV at ben15 failed with /T 
                   tperrno (TPEOS - operating system error)

In this particular example, the PeopleSoft Enterprise is running on a Solaris 10 system. Fortunately the error message is very clear in this case; and the failure is related to the message queues. During the domain boot up process, there is a call to msgget() to create a message queue. If the call to msgget() succeeds, it returns a non-negative integer that serves as the identifier for the newly created message queue. However in the case of a failure, it returns -1 and sets the error number to EACCES, EEXIST, ENOENT or ENOSPC depending on the underlying reason.

From the above error messages it clear that the msgget() failed with the errno set to ENOSPC (No space left on device). Man page of msgget(2) has the following explanation for ENOSPC error code on Solaris:

ERRORS
     The msgget() function will fail if:
     ...
     ...
     ENOSPC    A message queue identifier is to  be  created  but
               the  system-imposed limit on the maximum number of
               allowed  message  queue  identifiers  system  wide
               would be exceeded. See NOTES.

NOTES
     ...
     ...

     The system-imposed limit on  the  number  of  message  queue
     identifiers  is  maintained on a per-project basis using the
     project.max-msg-ids resource control.

It has enough clues to suspect the configured number for the message queue identifiers.

Prior to the release of Solaris 10, the /etc/system System V IPC tunable, msgsys:msginfo_msgmni, was used to control the maximum number of message queues that can be created. The default value on pre-Solaris 10 systems is 50.

With the release of Solaris 10, majority of the System V IPC tunables were obsoleted and equivalent resource controls were created for the remaining tunables to reduce the administrative overhead. On Solaris 10 and later versions, System V IPC can be tuned on a per project basis using the newly introduced resource controls.

On any Solaris 10 system, the resource control, project.max-msg-ids, replaced the old /etc/system tunable, msginfo_msgmni. And the default value has been raised to 128.

Now back to the failure in PeopleSoft environment. Let's first check the current value configured for project.max-msg-ids.

  • Get the project ID.
     % id -p
    uid=222227(psft) gid=2294(dba) projid=3(default)
  • Examine the project.max-msg-ids resource control for the project with ID 3, using the prctl utility.
     % prctl -n project.max-msg-ids -i project 3
    project: 3: default
    NAME    PRIVILEGE       VALUE    FLAG   ACTION                       RECIPIENT
    project.max-msg-ids
            privileged        128       -   deny                                 -
            system          16.8M     max   deny                                 -

Alternatively run the command ipcs -q to check the number of active message queues. Note that the project with id '3' is configured to create a maximum of 128 (default) message queues. In any case, the number of active message queues from the ipcs -q output may almost match with the configured value for the project.max-msg-ids.

Since it appears the configured PeopleSoft domain(s) needs more than 128 message queues in order to bring up all the application server processes that constitute the PeopleSoft Enterprise, the solution is to increase the value for the resource control, project.max-msg-ids, to any value beyond 128. For the sake of simplicity, let's increase it to 256 (2 \* default value, that is). Again prctl utility can be used to set the new value for the resource control.

  • Assume the privileges of the 'root' user
     % su
    Password:
  • Increase the maximum value for the message queue identifiers to 256 using the prctl utility.
     # prctl -n project.max-msg-ids -r -v 256 -i project 3
  • Verify the new maximum value for the message queue identifiers
     # prctl -n project.max-msg-ids -i project 3
    project: 3: default
    NAME    PRIVILEGE       VALUE    FLAG   ACTION                       RECIPIENT
    project.max-msg-ids
            privileged        256       -   deny                                 -
            system          16.8M     max   deny                                 -

With this change, the PeopleSoft Enterprise should boot up at least with no Failure to create message queue .. msgget: No space left on device errors.

Before we conclude, note that the above mentioned solution is not persistent across multiple operating system reboots. To make it persistent, create a new project using the projadd command. The man page for projadd(1M) has an example showing the creation of a project.

Friday Nov 21, 2008

Oracle on Solaris 10 : Fixing the 'ORA-27102: out of memory' Error

(Crossposting the 2+ year old blog entry from my other blog hosted on blogger. Source URL:
http://technopark02.blogspot.com/2006/09/solaris-10oracle-fixing-ora-27102-out.html)

Symptom:

As part of a database tuning effort you increase the SGA/PGA sizes; and Oracle greets with an ORA-27102: out of memory error message. The system had enough free memory to serve the needs of Oracle.

SQL> startup
ORA-27102: out of memory
SVR4 Error: 22: Invalid argument

Diagnosis
$ oerr ORA 27102
27102, 00000, "out of memory"
// \*Cause: Out of memory
// \*Action: Consult the trace file for details

Not so helpful. Let's look the alert log for some clues.

% tail -2 alert.log
WARNING: EINVAL creating segment of size 0x000000028a006000
fix shm parameters in /etc/system or equivalent

Oracle is trying to create a 10G shared memory segment (depends on SGA/PGA sizes), but operating system (Solaris in this example) responded with an invalid argument (EINVAL) error message. There is a little hint about setting shm parameters in /etc/system.

Prior to Solaris 10, shmsys:shminfo_shmmax parameter has to be set in /etc/system with maximum memory segment value that can be created. 8M is the default value on Solaris 9 and prior versions; where as 1/4th of the physical memory is the default on Solaris 10 and later. On a Solaris 10 (or later) system, it can be verified as shown below:

% prtconf | grep Mem
Memory size: 32760 Megabytes

% id -p
uid=59008(oracle) gid=10001(dba) projid=3(default)

% prctl -n project.max-shm-memory -i project 3
project: 3: default
NAME    PRIVILEGE       VALUE    FLAG   ACTION                       RECIPIENT
project.max-shm-memory
        privileged      7.84GB      -   deny                                 -
        system          16.0EB    max   deny                                 -

Now it is clear that the system is using the default value of 8G in this scenario, where as the application (Oracle) is trying to create a memory segment (10G) larger than 8G. Hence the failure.

So, the solution is to configure the system with a value large enough for the shared segment being created, so Oracle succeeds in starting up the database instance.

On Solaris 9 and prior releases, it can be done by adding the following line to /etc/system, followed by a reboot for the system to pick up the new value.

set shminfo_shmmax = 0x000000028a006000

However shminfo_shmmax parameter was obsoleted with the release of Solaris 10; and Sun doesn't recommend setting this parameter in /etc/system even though it works as expected.

On Solaris 10 and later, this value can be changed dynamically on a per project basis with the help of resource control facilities . This is how we do it on Solaris 10 and later:

% prctl -n project.max-shm-memory -r -v 10G -i project 3

% prctl -n project.max-shm-memory -i project 3
project: 3: default
NAME    PRIVILEGE       VALUE    FLAG   ACTION                       RECIPIENT
project.max-shm-memory
        privileged      10.0GB      -   deny                                 -
        system          16.0EB    max   deny                                 -

Note that changes made with the prctl command on a running system are temporary, and will be lost when the system is rebooted. To make the changes permanent, create a project with projadd command and associate it with the user account as shown below:

% projadd -p 3  -c 'eBS benchmark' -U oracle -G dba  -K 'project.max-shm-memory=(privileged,10G,deny)' OASB
% usermod -K project=OASB oracle
Finally make sure the project is created with projects -l or cat /etc/project commands.

% projects -l
...
...
OASB
        projid : 3
        comment: "eBS benchmark"
        users  : oracle
        groups : dba
        attribs: project.max-shm-memory=(privileged,10737418240,deny)

% cat /etc/project
...
...
OASB:3:eBS benchmark:oracle:dba:project.max-shm-memory=(privileged,10737418240,deny)

With these changes, Oracle would start the database up normally.

SQL> startup
ORACLE instance started.

Total System Global Area 1.0905E+10 bytes
Fixed Size                  1316080 bytes
Variable Size            4429966096 bytes
Database Buffers         6442450944 bytes
Redo Buffers               31457280 bytes
Database mounted.
Database opened.

Related information:

  1. What's New in Solaris System Tuning in the Solaris 10 Release?
  2. Resource Controls (overview)
  3. System Setup Recommendations for Solaris 8 and Solaris 9
  4. Man page of prctl(1)
  5. Man page of projadd


Addendum : Oracle RAC settings

Anonymous Bob suggested the following settings for Oracle RAC in the form of a comment for the benefit of others who run into similar issue(s) when running Oracle RAC. I'm pasting the comment as is (Disclaimer: I have not verified these settings):

Thanks for a great explanation, I would like to add one comment that will help those with an Oracle RAC installation. Modifying the default project covers oracle processes great and is all that is needed for a single instance DB. In RAC however, the CRS process starts the DB and it is a root owned process and root does not use the default project. To fix ORA-27102 issue for RAC I added the following lines to an init script that runs before the init.crs script fires.

# Recommended Oracle RAC system params
ndd -set /dev/udp udp_xmit_hiwat 65536
ndd -set /dev/udp udp_recv_hiwat 65536

# For root processes like crsd
prctl -n project.max-shm-memory -r -v 8G -i project system
prctl -n project.max-shm-ids -r -v 512 -i project system

# For oracle processes like sqlplus
prctl -n project.max-shm-memory -r -v 8G -i project default
prctl -n project.max-shm-ids -r -v 512 -i project default

So simple yet it took me a week working with Oracle and SUN to come up with that answer...Hope that helps someone out.

Bob
# posted by Blogger Bob : 6:48 AM, April 25, 2008

Monday Apr 23, 2007

Solaris: (Undocumented) Thread Environment Variables

The new thread library (initially called T2) on Solaris 8 Update 7 and later versions of Solaris provides a 1:1 threading model for better performance and scalability. Just like other libraries of Solaris, the new thread library is binary compatible with the old library, and the applications linked with old library continue to work as they are, without any changes to the application or any need to re-compile the code.

The primary difference being the new thread library uses a 1:1 thread model, where as the old library implements a two-level model in which user-level threads are multiplexed over possibly fewer light weight processes. In the 1:1 (or 1x1) model, an user level (application) thread has a corresponding kernel thread. A multi-threaded application with n threads will have n kernel threads. Each user level thread has a light weight process (LWP) connecting it to the kernel thread. Due to the more number of LWPs, the resource consumption will be a little high with 1x1 model, compared to MxN model; but due to the less overhead of multiplexing and scheduling the threads over LWP, 1x1 model performs much better compared to the old MxN model.

Tuning an application with thread environment variables

Within the new thread library (libthread) framework, a bunch of tunables have been provided to tweak the performance of the application. All these tunables have default values, and often the default values are good enough for the application to perform better. For some unknown reasons, these tunables were not documented anywhere except in source code. You can have a look at the source code for the following files at OpenSolaris web site.

  • thr.c <- check the routine etest()
  • synch.c <- read the comments and code between lines: 76 and 97

Here's a brief description of these environment variables along with their default values:
  1. LIBTHREAD_QUEUE_SPIN

    • Controls the spinning for queue locks
    • Default value: 1000

  2. LIBTHREAD_ADAPTIVE_SPIN

    • Specifies the number of iterations (spin count) for adaptive mutex locking before giving up and going to sleep
    • Default value: 1000

  3. LIBTHREAD_RELEASE_SPIN

    • Spin LIBTHREAD_RELEASE_SPIN times to see if a spinner grabs the lock; if so, don’t bother to wake up a waiter
    • Default value: LIBTHREAD_ADAPTIVE_SPIN/2 ie., 500. However it can be set explicitly to override the default value

  4. LIBTHREAD_MAX_SPINNERS

    • Limits the number of simultaneous spinners attempting to do adaptive locking
    • Default value: 100

  5. LIBTHREAD_MUTEX_HANDOFF

    • Do direct mutex handoff (no adaptive locking)
    • Default value: 0

  6. LIBTHREAD_QUEUE_FIFO

    • Specifies the frequency of FIFO queueing vs LIFO queueing (range is 0..8)
    • Default value: 4
    • LIFO queue ordering is unfair and can lead to starvation, but it gives better performance for heavily contended locks

      • 0 - every 256th time (almost always LIFO)
      • 1 - every 128th time
      • 2 - every 64th time
      • 3 - every 32nd time
      • 4 - every 16th time (default, mostly LIFO)
      • 5 - every 8th time
      • 6 - every 4th time
      • 7 - every 2nd time
      • 8 - every time (never LIFO, always FIFO)

  7. LIBTHREAD_QUEUE_DUMP

    • Causes a dump of user−level sleep queue statistics onto stderr (file descriptor 2) on exit
    • Default value: 0

  8. LIBTHREAD_STACK_CACHE

    • Specifies the number of cached stacks the library keeps for re−use when more threads are created
    • Default value: 10

  9. LIBTHREAD_COND_WAIT_DEFER

    • Little bit of history:

      The old thread library (Solaris 8's default) forces the thread performing cond_wait() to block all signals until it reacquires the mutex.

      The new thread library on Solaris 9 (default) and later versions, behaves exact opposite. The state of the mutex in the signal handler is undefined and the thread does not block any more signals than it is already blocking on entry to cond_wait().

      However to accomodate the applications that rely on old behavior, the new thread library implemented the old behavior as well, and it can be controlled with the LIBTHREAD_COND_WAIT_DEFER env variable. To get the old behavior, this variable has to be set to 1.


  10. LIBTHREAD_ERROR_DETECTION

    • Setting it to 1 issues a warning message about illegal locking operations and setting it to 2 issues the warning message and core dump
    • Default value: 0

Reference:
Roger Faulkner's New/Improved Threads Library presentation

(Original blog post is at:
http://technopark02.blogspot.com/2005/06/solaris-undocumented-thread.html
)

Sunday Apr 15, 2007

C/C++: Printing Stack Trace with printstack() on Solaris

On Solaris 9 and later, libc provides an useful function called printstack, to print a symbolic stack trace to the specified file descriptor. This is useful for reporting errors from an application during run-time.

If the stack trace appears corrupted, or if the stack cannot be read, printstack() returns -1.

Here is a programmatic example:

% more printstack.c

#include <stdio.h>
#include <ucontext.h>

int callee (int file)
{
    printstack(file);
    return (0);
}

int caller()
{
    int a;
    a = callee (fileno(stdout));
    return (a);
}

int main()
{
    caller();
    return (0);
}

% cc -V
cc: Sun C 5.8 Patch 121016-04 2006/10/18 <- Sun Studio 12 C compiler
usage: cc [ options] files.  Use 'cc -flags' for details

% cc -o stacktrace stacktrace.c

% ./stacktrace
/tmp/stacktrace:callee+0x18
/tmp/stacktrace:caller+0x22
/tmp/stacktrace:main+0x14
/tmp/stacktrace:0x6d2

The printstack() function uses dladdr1() to obtain symbolic symbol names. As a result, only global symbols are reported as symbol names by printstack().

The stack trace from a C++ program, will have all the symbols in their mangled form. So as of now, the programmers may need to have their own wrapper functions to print the stack trace in demangled form.

% CC -V
CC: Sun C++ 5.8 Patch 121018-07 2006/11/01 <- Sun Studio 12 C++ compiler

% CC -o stacktrace stacktrace.c

% ./stacktrace
/tmp/stacktrace:__1cGcallee6Fi_i_+0x18
/tmp/stacktrace:__1cGcaller6F_i_+0x22
/tmp/stacktrace:main+0x14
/tmp/stacktrace:0x91a

There has been an RFE (Request For Enhancement), 6182270 printstack should provide demangled names while using with c++, in place against Solaris' libc to print the stack trace in demangled form, when printstack() has been called from a C++ program. This will be released as a libc patch for Solaris 9 & 10, as soon as it gets enough attention.

% /usr/ccs/bin/elfdump libc.so.1  | grep printstack
     [677]  0x00069f70 0x0000004c  FUNC WEAK  D    9 .text          printstack
    [1425]  0x00069f70 0x0000004c  FUNC GLOB  D   36 .text          _printstack
     ...

Since the object code is automatically linked with libc during the creation of an executable or a dynamic library, the programmer need not specify -lc on the compile line.

Suggested Reading:
Man page of walkcontext / printstack

(Original blog post is at:
http://technopark02.blogspot.com/2005/05/cc-printing-stack-trace-with.html)

Monday Mar 12, 2007

Solaris: How To Disable Out Of The Box (OOB) Large Page Support?

Starting with the release of Solaris 10 1/06 (aka Solaris 10 Update 1), large page OOB feature turns on MPSS (Multiple Page Size Support) automatically for applications' data (heap) and text (libraries).

One obvious advantage of this large page OOB feature is that it improves the performance of user land applications by reducing the wastage of CPU cycles in serving iTLB and dTLB misses. For example, if the heap size of a process is 256M, on a Niagara (UltraSPARC-T1) box it will be mapped on to a single 256M page. On a system that doesn't support large pages, it will be mapped on to 32,768 8K pages. Now imagine having all the words of a story on a single large page versus having the words spread into 32,500+ small pages. Which one do you prefer?

However large page OOB feature may have negative impact on some applications - For example, an application may crash due to some wrong assumption(s) about the page size {by the application} or there could be an increase in virtual memory consumption due to the way the data and libraries are mapped on to larger pages.

Fortunately there are a bunch of /etc/system tunables to enable/disable large page OOB support on Solaris.

/etc/system tunables to disable large page OOB feature

  • set exec_lpg_disable = 1

    This parameter prevents large pages from being used when the kernel is allocating memory for processes being executed. These constitute the memory needed for a processes' text/data/bss.

  • set use_brk_lpg = 0

    This parameter prevents large pages from being used for heap. To enable large pages for heap, set the value of this parameter to 1 or remove this parameter from /etc/system completely.

    Note:
    brk() is the kernel routine that is called whenever a user level application invokes malloc().

  • set use_stk_lpg = 0

    This parameter disables the large pages for stack. Set it to 1 to retain the default functionality.

  • set use_zmap_lpg = 0

    This variable controls the size of anonymous (anon) pages.

  • set use_text_pgsz4m = 0

    This tunable disables the default use of 4M text pages on UltraSPARC-III/III+/IV/IV+/T1 platforms.

  • set use_text_pgsz64k = 0

    This tunable disables the default use of 64K text pages on UltraSPARC-T1 (Niagara) platform.

  • set use_initdata_pgsz64k = 0

    This tunable disables the default use of 64K data pages on UltraSPARC-T1 (Niagara) platform.


Tuning off large page OOB support for heap/stack/anon pages on-the-fly

Setting /etc/system parameters require the system to be rebooted to enable/disable large page OOB support. However it is possible to set the desired page size for heap/stack/anon pages dynamically as shown below. Note that the system goes back to the default behavior when it is rebooted. Depending on the need to turn off large page support, use mdb or /etc/system parameters at your discretion.

To turn off large page support for heap, stack and anon pages dynamically, set the following under mdb -kw:

  • use_brk_lpg/W 0 (heap)
  • use_stk_lpg/W 0 (stack)
  • use_zmap_lpg/W 0 (anon)

Note:
Java sets its own page size with memctl() interface - so, the /etc/system changes won't impact Java at all. Consider using the JVM option -XX:LargePageSizeInBytes=pagesize[K|M] to set the desired page size for Java process mappings.

How to check whether disabling large page support is really helping?

Compare the outputs of the following {along with application specific data} before and after changes:

  • vmstat 2 50 - look under free and id columns
  • trapstat -T 5 5 - check %time column
  • mdb -k and then ::memstat

How to set the maximum large page size?

Run pagesize -a to get the list of supported page sizes on your system. Then set the page size of your choice as shown below.

% mdb -kw
Loading modules: [ unix krtld genunix specfs dtrace ufs sd ip sctp usba random fcp fctl nca
lofs ssd logindmux ptm cpc sppp crypto nfs ipc ]
> auto_lpg_maxszc/W <hex_value>
where:
hex_value = { 0x0 for 8K,
0x1 for 64K,
0x2 for 512K,
0x3 for 4M,
0x4 for 32M and
0x5 for 256M }

How to check the maximum page size in use?

Here is an example from a Niagara box (T2000):

% pagesize -a
8192
65536
4194304
268435456
% mdb -kw
Loading modules: [ unix krtld genunix specfs dtrace ufs sd ip sctp usba random fcp fctl nca
lofs ssd logindmux ptm cpc sppp crypto nfs ipc ]
> auto_lpg_maxszc/X
auto_lpg_maxszc:
auto_lpg_maxszc:5
> ::quit

See Also:
6287398 vm 1.5 dumps core with -d64

Acknowledgements:
Sang-Suan Sam Gam

(Original blog post is at:
http://technopark02.blogspot.com/2006/11/solaris-disabling-out-of-box-oob-large.html
)

Thursday Dec 07, 2006

Solaris: Workaround to stdio's 255 open file descriptors limitation

A quick web search with key words Solaris stdio open file descriptors results in numerous references to stdio's limitation of 255 open file descriptors on Solaris.

#1085341: 32-bit stdio routines should support file descriptors >255, a 14 year old RFE explains the problem and the bug report links to handful of other bugs which are some how related to stdio's open file descriptors limitation.

Now the good news: the wait is finally over. Sun Microsystems finally made a fix/workaround available to the community in the form of a new library called extendedFILE. If you are running Solaris Express (SX) 06/06 or later builds, your system already has the workaround. You just need to enable it to get around the 255 open file descriptors problem with stdio's API. This workaround is part of the Update 3 release of Solaris 10 (Solaris 10 11/06). [Update: 02/05/07] This workaround will be part of the next major release of Solaris 10, which is going to be Solaris 10 Update 4 - For some reason, it couldn't make it to Update 3. There are some plans to backport this workaround to Solaris 9 and 10. However there is no clear timeline for completion of this backport, at the moment.

The workaround does not require any source code changes or re-compilation of the objects. You just need to increase the file descriptor limit using limit or ulimit commands; and pre-load /usr/lib/extendedFILE.so.1 before running the application.

However applications fiddling with _file field of FILE structure may not work. This is because when extendedFILE library is pre-loaded, descriptors > 255 will be stored in an auxiliary location and a fake descriptor will be stored in the FILE structure's _file field. In fact, accessing _file field was long discouraged; and to discourage non-standard practices even further _file has been renamed to _magic starting with SX 06/06. So, applications which access _file directly rather than with fileno() function, may encounter compilation errors starting with S10 U3 U4. This step is necessary to ensure that the source code is in a clean state, so the resulting object code is not vulnerable to data corruption during run-time.

The following example shows the failure and the steps to workaround the issue. Note that with the extendedfile library pre-loaded, the process can open upto 65532 files excluding stdin, out and err.

\* Test case (simple C program tries to open 65536 files):
% cat files.c
#include <stdio.h>
#define NoOfFILES 65536

int main()
{
        char filename[10];
        FILE \*fds[NoOfFILES];
        int i;

        for (i = 0; i < NoOfFILES; ++i)
        {
                sprintf (filename, "%d.log", i);
                fds[i] = fopen(filename, "w");

                if (fds[i] == NULL)
                {
                        printf("\\n\*\* Number of open files = %d. fopen() failed with error:  ", i);
                        perror("");
                        exit(1);
                }
                else
                {
                        fprintf (fds[i], "some string");
                }

        }
        return (0);
}

% cc -o files files.c
\* Re-producing the failure:
 % limit
 cputime         unlimited
 filesize        unlimited
 datasize        unlimited
 stacksize       8192 kbytes
 coredumpsize    unlimited
 descriptors     256
 memorysize      unlimited

 % uname -a
 SunOS sunfire4 5.11 snv_42 sun4u sparc SUNW,Sun-Fire-280R

 %./files

 \*\* Number of open files = 253. fopen() failed with error: Too many open files

\* Showcasing the Solaris workaround:
 % limit descriptors 5000

 % limit
 cputime         unlimited
 filesize        unlimited
 datasize        unlimited
 stacksize       8192 kbytes
 coredumpsize    unlimited
 descriptors     5000
 memorysize      unlimited

 % setenv LD_PRELOAD_32 /usr/lib/extendedFILE.so.1

 % ./files

 \*\* Number of open files = 4996. fopen() failed with error: Too many open files

 % limit descriptors 65536

 % limit
 cputime         unlimited
 filesize        unlimited
 datasize        unlimited
 stacksize       8192 kbytes
 coredumpsize    unlimited
 descriptors     65536
 memorysize      unlimited

 %./files

 \*\* Number of open files = 65532. fopen() failed with error: Too many open files

(Note that descriptor 0, 1 and 2 will be used by stdin, stdout and stderr)

 % ls -l 65531.log
 -rw-rw-r--   1 gmandali ccuser        11 Aug  9 12:35 65531.log

For further information about the extendedFILE library and for the extensions to fopen, fdopen, popen, .. please have a look at the new and updated man pages:

extendedFILE
enable_extended_FILE_stdio
fopen
fdopen


Acknowledgements:
Craig Mohrman

(Original blog post is at:
http://technopark02.blogspot.com/2006/08/solaris-workaround-to-stdios-255-open.html
)
About

Benchmark announcements, HOW-TOs, Tips and Troubleshooting

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today