Mittwoch Jun 12, 2013

Growing the root pool

Some small inbetween laptop experiences...  I finally decided to throw away that other OS (I used it so rarely that I regularily had to use the password reset procedure...).  That gave me another 50g of valuable laptop disk space - furtunately on the right part of the disk.  So in theory, all I'd have to do is resize the Solaris partition, tell ZFS about it and be happy...  Of course, there are the usual pitfalls.

To avoid confusion, much of this is x86 related.  On normal SPARC servers, you don't have any of the problems for which I describe solutions here...

First of all, you should *not* try to resize the partition that hosts your rpool while Solaris is up and running.  It works, but there are nicer ways to do a shutdown.  (What happens is that fdisk will not only create the new partition, but also write a default label in that partition, which means that ZFS will not find it's slice, which will make Solaris very unresponsive...)  The right way to do this is to boot off something else (PXE, USB, DVD, whatever) and then change the partition size.  Once that's done, re-create the slice for the ZFS rpool.  The important part is to use the very same starting cylinder.  The length, naturally, will be larger.  (At least, I had to do that, since the original zpool lived in a slice.)

After that, it's back to the book:  Boot Solaris and choose one of "zpool set autoexpand=on rpool" or "zpool online -e rpool c0t0d0s0" and there you go - 50g more space.

Did I forget to mention that I actually did a full backup before all of this?  I must be getting old...

Mittwoch Nov 07, 2012

20 Years of Solaris - 25 Years of SPARC!

I don't usually duplicate what can be found elsewhere.  But this is worth an exception.

20 Years of Solaris - Guess who got all those innovation awards!
25 Years of SPARC - And the future has just begun :-)

Check out those pages for some links pointing to the past, and, more interesting, to the future...

There are also some nice videos: 20 Years of Solaris - 25 Years of SPARC

(Come to think of it - I got to be part of all but the first 4 years of Solaris.  I must be getting older...)

Dienstag Apr 17, 2012

Solaris Zones: Virtualization that Speeds up Benchmarks

One of the first questions that typically comes up when I talk to customers about virtualization is the overhead involved.  Now we all know that virtualization with hypervisors comes with an overhead of some sort.  We should also all know that exactly how big that overhead is depends on the type of workload as much as it depends on the hypervisor used.  While there have been attempts to create standard benchmarks for this, quantifying hypervisor overhead is still mostly hidden in the mists of marketing and benchmark uncertainty.  However, what always raises eyebrows is when I come to Solaris Zones (called Containers in Solaris 10) as an alternative to hypervisor virtualization.  Since Zones are, greatly simplyfied, nothing more than a group of Unix processes contained by a set of rules which are enforced by the Solaris kernel, it is quite evident that there can't be much overhead involved.  Nevertheless, since many people think in hypervisor terms, there is almost always some doubt about this claim of zero overhead.  And as much as I find the explanation with technical details compelling, I also understand that seeing is so much better than believing.  So - look and see:

The Oracle benchmark teams are so convinced of the advantages of Solaris Zones that they actually use them in the configurations for public benchmarking.  Solaris resource management will also work in a non Zones environment, but Zones make it just so much easier to handle, especially with some of the more complex benchmark configurations.  There are numerous benchmark publications available using Solaris Containers, dating back to the days of the T5440.  Some recent examples, all of them world records, are:

The use of Solaris Zones is documented in all of these benchmark publications.

The benchmarking team also published a blog entry detailing how they make use of resource management with Solaris Zones to actually increase application performance.  That almost asks for calling this "negative overhead", if the term weren't somewhat misleading.

So, if you ever need to substantiate why Solaris Zones have no virtualization overhead, point to these (and probably some more) published benchmarks.

Montag Mrz 19, 2012

Setting up a local AI server - easy with Solaris 11

Many things are new in Solaris 11, Autoinstall is one of them.  If, like me, you've known Jumpstart for the last 2 centuries or so, you'll have to start from scratch.  Well, almost, as the concepts are similar, and it's not all that difficult.  Just new.

I wanted to have an AI server that I could use for demo purposes, on the train if need be.  That answers the question of hardware requirements: portable.  But let's start at the beginning.

First, you need an OS image, of course.  In the new world of Solaris 11, it is now called a repository.  The original can be downloaded from the Solaris 11 page at Oracle.   What you want is the "Oracle Solaris 11 11/11 Repository Image", which comes in two parts that can be combined using cat.  MD5 checksums for these (and all other downloads from that page) are available closer to the top of the page.

With that, building the repository is quick and simple:

# zfs create -o mountpoint=/export/repo rpool/ai/repo
# zfs create rpool/ai/repo/sol11
# mount -o ro -F hsfs /tmp/sol-11-1111-repo-full.iso /mnt
# rsync -aP /mnt/repo /export/repo/sol11
# umount /mnt
# pkgrepo rebuild -s /export/repo/sol11/repo
# zfs snapshot rpool/ai/repo/sol11@fcs
# pkgrepo info -s  /export/repo/sol11/repo
PUBLISHER PACKAGES STATUS           UPDATED
solaris   4292     online           2012-03-12T20:47:15.378639Z
That's all there's to it.  Let's make a snapshot, just to be on the safe side.  You never know when one will come in handy.  To use this repository, you could just add it as a file-based publisher:
# pkg set-publisher -g file:///export/repo/sol11/repo solaris
In case I'd want to access this repository through a (virtual) network, i'll now quickly activate the repository-service:
# svccfg -s application/pkg/server \
setprop pkg/inst_root=/export/repo/sol11/repo
# svccfg -s application/pkg/server setprop pkg/readonly=true
# svcadm refresh application/pkg/server
# svcadm enable application/pkg/server

That's all you need - now point your browser to http://localhost/ to view your beautiful repository-server. Step 1 is done.  All of this, by the way, is nicely documented in the README file that's contained in the repository image.

Of course, we already have updates to the original release.  You can find them in MOS in the Oracle Solaris 11 Support Repository Updates (SRU) Index.  You can simply add these to your existing repository or create separate repositories for each SRU.  The individual SRUs are self-sufficient and incremental - SRU4 includes all updates from SRU2 and SRU3.  With ZFS, you can also get both: A full repository with all updates and at the same time incremental ones up to each of the updates:

# mount -o ro -F hsfs /tmp/sol-11-1111-sru4-05-incr-repo.iso /mnt
# pkgrecv -s /mnt/repo -d /export/repo/sol11/repo '*'
# umount /mnt
# pkgrepo rebuild -s /export/repo/sol11/repo
# zfs snapshot rpool/ai/repo/sol11@sru4
# zfs set snapdir=visible rpool/ai/repo/sol11
# svcadm restart svc:/application/pkg/server:default
The normal repository is now updated to SRU4.  Thanks to the ZFS snapshots, there is also a valid repository of Solaris 11 11/11 without the update located at /export/repo/sol11/.zfs/snapshot/fcs . If you like, you can also create another repository service for each update, running on a separate port.

But now lets continue with the AI server.  Just a little bit of reading in the dokumentation makes it clear that we will need to run a DHCP server for this.  Since I already have one active (for my SunRay installation) and since it's a good idea to have these kinds of services separate anyway, I decided to create this in a Zone.  So, let's create one first:

# zfs create -o mountpoint=/export/install rpool/ai/install
# zfs create -o mountpoint=/zones rpool/zones
# zonecfg -z ai-server
zonecfg:ai-server> create
create: Using system default template 'SYSdefault'
zonecfg:ai-server> set zonepath=/zones/ai-server
zonecfg:ai-server> add dataset
zonecfg:ai-server:dataset> set name=rpool/ai/install
zonecfg:ai-server:dataset> set alias=install
zonecfg:ai-server:dataset> end
zonecfg:ai-server> commit
zonecfg:ai-server> exit
# zoneadm -z ai-server install
# zoneadm -z ai-server boot ; zlogin -C ai-server
Give it a hostname and IP address at first boot, and there's the Zone.  For a publisher for Solaris packages, it will be bound to the "System Publisher" from the Global Zone.  The /export/install filesystem, of course, is intended to be used by the AI server.  Let's configure it now:
#zlogin ai-server
root@ai-server:~# pkg install install/installadm
root@ai-server:~# installadm create-service -n x86-fcs -a i386 \
-s pkg://solaris/install-image/solaris-auto-install@5.11,5.11-0.175.0.0.0.2.1482 \
-d /export/install/fcs -i 192.168.2.20 -c 3

With that, the core AI server is already done.  What happened here?  First, I installed the AI server software.  IPS makes that nice and easy.  If necessary, it'll also pull in the required DHCP-Server and anything else that might be missing.  Watch out for that DHCP server software.  In Solaris 11, there are two different versions.  There's the one you might know from Solaris 10 and earlier, and then there's a new one from ISC.  The latter is the one we need for AI.  The SMF service names of both are very similar.  The "old" one is "svc:/network/dhcp-server:default". The ISC-server comes with several SMF-services. We at least need "svc:/network/dhcp/server:ipv4". 

The command "installadm create-service" creates the installation-service. It's called "x86-fcs", serves the "i386" architecture and gets its boot image from the repository of the system publisher, using version 5.11,5.11-0.175.0.0.0.2.1482, which is Solaris 11 11/11.  (The option "-a i386" in this example is optional, since the installserver itself runs on a x86 machine.) The boot-environment for clients is created in /export/install/fcs and the DHCP-server is configured for 3 IP-addresses starting at 192.168.2.20.  This configuration is stored in a very human readable form in /etc/inet/dhcpd4.conf.  An AI-service for SPARC systems could be created in the very same way, using "-a sparc" as the architecture option.

Now we would be ready to register and install the first client.  It would be installed with the default "solaris-large-server" using the publisher "http://pkg.oracle.com/solaris/release" and would query it's configuration interactively at first boot.  This makes it very clear that an AI-server is really only a boot-server.  The true source of packets to install can be different.  Since I don't like these defaults for my demo setup, I did some extra config work for my clients.

The configuration of a client is controlled by manifests and profiles.  The manifest controls which packets are installed and how the filesystems are layed out.  In that, it's very much like the old "rules.ok" file in Jumpstart.  Profiles contain additional configuration like root passwords, primary user account, IP addresses, keyboard layout etc.  Hence, profiles are very similar to the old sysid.cfg file.

The easiest way to get your hands on a manifest is to ask the AI server we just created to give us it's default one.  Then modify that to our liking and give it back to the installserver to use:

root@ai-server:~# mkdir -p /export/install/configs/manifests
root@ai-server:~# cd /export/install/configs/manifests
root@ai-server:~# installadm export -n x86-fcs -m orig_default \
-o orig_default.xml
root@ai-server:~# cp orig_default.xml s11-fcs.small.local.xml
root@ai-server:~# vi s11-fcs.small.local.xml
root@ai-server:~# more s11-fcs.small.local.xml
<!DOCTYPE auto_install SYSTEM "file:///usr/share/install/ai.dtd.1">
<auto_install>
  <ai_instance name="S11 Small fcs local">
    <target>
      <logical>
        <zpool name="rpool" is_root="true">
          <filesystem name="export" mountpoint="/export"/>
          <filesystem name="export/home"/>
          <be name="solaris"/>
        </zpool>
      </logical>
    </target>
    <software type="IPS">
      <destination>
        <image>
          <!-- Specify locales to install -->
          <facet set="false">facet.locale.*</facet>
          <facet set="true">facet.locale.de</facet>
          <facet set="true">facet.locale.de_DE</facet>
          <facet set="true">facet.locale.en</facet>
          <facet set="true">facet.locale.en_US</facet>
        </image>
      </destination>
      <source>
        <publisher name="solaris">
          <origin name="http://192.168.2.12/"/>
        </publisher>
      </source>
      <!--
        By default the latest build available, in the specified IPS
        repository, is installed.  If another build is required, the
        build number has to be appended to the 'entire' package in the
        following form:

            <name>pkg:/entire@0.5.11-0.build#</name>
      -->
      <software_data action="install">
        <name>pkg:/entire@0.5.11,5.11-0.175.0.0.0.2.0</name>
        <name>pkg:/group/system/solaris-small-server</name>
      </software_data>
    </software>
  </ai_instance>
</auto_install>

root@ai-server:~# installadm create-manifest -n x86-fcs -d \
-f ./s11-fcs.small.local.xml 
root@ai-server:~# installadm list -m -n x86-fcs
Manifest             Status    Criteria 
--------             ------    -------- 
S11 Small fcs local  Default   None
orig_default         Inactive  None

The major points in this new manifest are:

  • Install "solaris-small-server"
  • Install a few locales less than the default.  I'm not that fluid in French or Japanese...
  • Use my own package service as publisher, running on IP address 192.168.2.12
  • Install the initial release of Solaris 11:  pkg:/entire@0.5.11,5.11-0.175.0.0.0.2.0

Using a similar approach, I'll create a default profile interactively and use it as a template for a few customized building blocks, each defining a part of the overall system configuration.  The modular approach makes it easy to configure numerous clients later on:

root@ai-server:~# mkdir -p /export/install/configs/profiles
root@ai-server:~# cd /export/install/configs/profiles
root@ai-server:~# sysconfig create-profile -o default.xml
root@ai-server:~# cp default.xml general.xml; cp default.xml mars.xml
root@ai-server:~# cp default.xml user.xml
root@ai-server:~# vi general.xml mars.xml user.xml
root@ai-server:~# more general.xml mars.xml user.xml
::::::::::::::
general.xml
::::::::::::::
<!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">
<service_bundle type="profile" name="sysconfig">
  <service version="1" type="service" name="system/timezone">
    <instance enabled="true" name="default">
      <property_group type="application" name="timezone">
        <propval type="astring" name="localtime" value="Europe/Berlin"/>
      </property_group>
    </instance>
  </service>
  <service version="1" type="service" name="system/environment">
    <instance enabled="true" name="init">
      <property_group type="application" name="environment">
        <propval type="astring" name="LANG" value="C"/>
      </property_group>
    </instance>
  </service>
  <service version="1" type="service" name="system/keymap">
    <instance enabled="true" name="default">
      <property_group type="system" name="keymap">
        <propval type="astring" name="layout" value="US-English"/>
      </property_group>
    </instance>
  </service>
  <service version="1" type="service" name="system/console-login">
    <instance enabled="true" name="default">
      <property_group type="application" name="ttymon">
        <propval type="astring" name="terminal_type" value="vt100"/>
      </property_group>
    </instance>
  </service>
  <service version="1" type="service" name="network/physical">
    <instance enabled="true" name="default">
      <property_group type="application" name="netcfg">
        <propval type="astring" name="active_ncp" value="DefaultFixed"/>
      </property_group>
    </instance>
  </service>
  <service version="1" type="service" name="system/name-service/switch">
    <property_group type="application" name="config">
      <propval type="astring" name="default" value="files"/>
      <propval type="astring" name="host" value="files dns"/>
      <propval type="astring" name="printer" value="user files"/>
    </property_group>
    <instance enabled="true" name="default"/>
  </service>
  <service version="1" type="service" name="system/name-service/cache">
    <instance enabled="true" name="default"/>
  </service>
  <service version="1" type="service" name="network/dns/client">
    <property_group type="application" name="config">
      <property type="net_address" name="nameserver">
        <net_address_list>
          <value_node value="192.168.2.1"/>
        </net_address_list>
      </property>
    </property_group>
    <instance enabled="true" name="default"/>
  </service>
</service_bundle>
::::::::::::::
mars.xml
::::::::::::::
<!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">
<service_bundle type="profile" name="sysconfig">
  <service version="1" type="service" name="network/install">
    <instance enabled="true" name="default">
      <property_group type="application" name="install_ipv4_interface">
        <propval type="astring" name="address_type" value="static"/>
        <propval type="net_address_v4" name="static_address" 
                 value="192.168.2.100/24"/>
        <propval type="astring" name="name" value="net0/v4"/>
        <propval type="net_address_v4" name="default_route" 
                 value="192.168.2.1"/>
      </property_group>
      <property_group type="application" name="install_ipv6_interface">
        <propval type="astring" name="stateful" value="yes"/>
        <propval type="astring" name="stateless" value="yes"/>
        <propval type="astring" name="address_type" value="addrconf"/>
        <propval type="astring" name="name" value="net0/v6"/>
      </property_group>
    </instance>
  </service>
  <service version="1" type="service" name="system/identity">
    <instance enabled="true" name="node">
      <property_group type="application" name="config">
        <propval type="astring" name="nodename" value="mars"/>
      </property_group>
    </instance>
  </service>
</service_bundle>
::::::::::::::
user.xml
::::::::::::::
<!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">
<service_bundle type="profile" name="sysconfig">
  <service version="1" type="service" name="system/config-user">
    <instance enabled="true" name="default">
      <property_group type="application" name="root_account">
        <propval type="astring" name="login" value="root"/>
        <propval type="astring" name="password" 
                 value="noIWillNotTellYouMyPasswordNotEvenEncrypted"/>
        <propval type="astring" name="type" value="role"/>
      </property_group>
      <property_group type="application" name="user_account">
        <propval type="astring" name="login" value="stefan"/>
        <propval type="astring" name="password" 
                 value="noIWillNotTellYouMyPasswordNotEvenEncrypted"/>
        <propval type="astring" name="type" value="normal"/>
        <propval type="astring" name="description" value="Stefan Hinker"/>
        <propval type="count" name="uid" value="12345"/>
        <propval type="count" name="gid" value="10"/>
        <propval type="astring" name="shell" value="/usr/bin/bash"/>
        <propval type="astring" name="roles" value="root"/>
        <propval type="astring" name="profiles" value="System Administrator"/>
        <propval type="astring" name="sudoers" value="ALL=(ALL) ALL"/>
      </property_group>
    </instance>
  </service>
</service_bundle>
root@ai-server:~# installadm create-profile -n x86-fcs -f general.xml
root@ai-server:~# installadm create-profile -n x86-fcs -f user.xml
root@ai-server:~# installadm create-profile -n x86-fcs -f mars.xml \
-c ipv4=192.168.2.100
root@ai-server:~# installadm list -p

Service Name  Profile     
------------  -------     
x86-fcs       general.xml
              mars.xml
              user.xml

root@ai-server:~# installadm list -n x86-fcs -p

Profile      Criteria 
-------      -------- 
general.xml  None
mars.xml     ipv4 = 192.168.2.100
user.xml     None

Here's the idea behind these files:

  • "general.xml" contains settings valid for all my clients.  Stuff like DNS servers, for example, which in my case will always be the same.
  • "user.xml" only contains user definitions.  That is, a root password and a primary user.
    Both of these profiles will be valid for all clients (for now).
  • "mars.xml" defines network settings for an individual client.  This profile is associated with an IP-Address.  For this to work, I'll have to tweak the DHCP-settings in the next step:
root@ai-server:~# installadm create-client -e 08:00:27:AA:3D:B1 -n x86-fcs
root@ai-server:~# vi /etc/inet/dhcpd4.conf
root@ai-server:~# tail -5 /etc/inet/dhcpd4.conf
host 080027AA3DB1 {
  hardware ethernet 08:00:27:AA:3D:B1;
  fixed-address 192.168.2.100;
  filename "01080027AA3DB1";
}

This completes the client preparations.  I manually added the IP-Address for mars to /etc/inet/dhcpd4.conf.  This is needed for the "mars.xml" profile.  Disabling arbitrary DHCP-replies will shut up this DHCP server, making my life in a shared environment a lot more peaceful ;-)

Note: The above example shows the configuration for x86 clients.  SPARC clients have a slightly different entry in the dhcp config file, again with some manual tweaking to create a fixed IP address for my client:

subnet 192.168.2.0 netmask 255.255.255.0 {
  range 192.168.2.200 192.168.2.201
  option broadcast-address 192.168.2.255;
  option routers 192.168.2.1;
  next-server 192.168.2.13;
}

class "SPARC" {
  match if not (substring(option vendor-class-identifier, 0, 9) = "PXEClient");
  filename "http://192.168.2.13:5555/cgi-bin/wanboot-cgi";
}

host sparcy {
   hardware ethernet 00:14:4f:fb:52:3c ;
   fixed-address 192.168.2.202 ;
}
Now, I of course want this installation to be completely hands-off.  For this to work, I'll need to modify the grub boot menu for this client slightly.  You can find it in /etc/netboot.  "installadm create-client" will create a new boot menu for every client, identified by the client's MAC address.  The template for this can be found in a subdirectory with the name of the install service, /etc/netboot/x86-fcs in our case.  If you don't want to change this manually for every client, modify that template to your liking instead.
root@ai-server:~# cd /etc/netboot
root@ai-server:~# cp menu.lst.01080027AA3DB1 menu.lst.01080027AA3DB1.org
root@ai-server:~# vi menu.lst.01080027AA3DB1
root@ai-server:~# diff menu.lst.01080027AA3DB1 menu.lst.01080027AA3DB1.org
1,2c1,2
< default=1
< timeout=10
---
> default=0
> timeout=30
root@ai-server:~# more menu.lst.01080027AA3DB1
default=1
timeout=10
min_mem64=0

title Oracle Solaris 11 11/11 Text Installer and command line
	kernel$ /x86-fcs/platform/i86pc/kernel/$ISADIR/unix -B install_media=htt
p://$serverIP:5555//export/install/fcs,install_service=x86-fcs,install_svc_addre
ss=$serverIP:5555
	module$ /x86-fcs/platform/i86pc/$ISADIR/boot_archive

title Oracle Solaris 11 11/11 Automated Install
	kernel$ /x86-fcs/platform/i86pc/kernel/$ISADIR/unix -B install=true,inst
all_media=http://$serverIP:5555//export/install/fcs,install_service=x86-fcs,inst
all_svc_address=$serverIP:5555,livemode=text
	module$ /x86-fcs/platform/i86pc/$ISADIR/boot_archive

Now just boot the client off the network using PXE-boot.  For my demo purposes, that's a client from VirtualBox, of course.   Again, if this were a SPARC system, you'd instead be typing "boot net:dhcp - install" at the OK prompt and then just watch the installation.

That's all there's to it.  And despite the fact that this blog entry is a little longer - that wasn't that hard now, was it?

Mittwoch Feb 29, 2012

Solaris Fingerprint Database - How it's done in Solaris 11

Many remember the Solaris Fingerprint Database. It was a great tool to verify the integrity of a solaris binary.  Unfortunately, it went away with the rest of sunsolve, and was not revived in the replacement, "My Oracle Support".  Here's the good news:  It's back for Solaris 11, and it's better than ever!

It is now totally integrated with IPS...  Read more

[Read More]

Montag Feb 20, 2012

Solaris 11 submitted for EAL4+ certification

Solaris 11 has been submitted for certification by the Canadian Common Criteria Scheme in Level EAL4+. They will be certifying against the protection profile "Operating System Protection Profile (OS PP)" as well as the extensions

  • Advanced Management (AM)
  • Extended Identification and Authentication (EIA)
  • Labeled Security (LS)
  • Virtualization (VIRT)

EAL4+ is the highest level typically achievable for commercial software,
and is the highest level mutually recognized by 26 countries, including Germany and the USA. Completion of the certification lies in the hands of the certification authority.

You can check the current status of this certification (as well as other certified Oracle software) on the page Oracle Security Evaluations.

Freitag Okt 07, 2011

Solaris 11 Launch

There have been many questions and rumors about the upcoming launch of Solaris 11.  Now it's out:  Watch the webcast on

November 9, 2011
at 10am ET

Be invited to join!

(I hope to get around summarizing all the OpenWorld announcements, especially around T4, soon...)

Montag Aug 01, 2011

Oracle Solaris Studio 12.3 Beta Program

The beta program for Oracle Solaris Studio 12.3 is now open for participation.  Anyone willing to test the newest compiler and developer tools is welcome to join.  You may expect performance improvements over earlier versions of Studio as well as GCC that make testing worth your while.

Happy testing!

Donnerstag Mrz 31, 2011

What's up with Solaris 11?

Interested in the upcoming Solaris 11?  What will be the highlights?  What exactly is the new packaging format, how does the new installer work?  What do the analysts think?


All this will be covered in the Solaris Online Forum on April 14, starting at 9 am PST.  This will be a live event where you can ask questions. (A recording will be available afterwards.)  Speakers are all high level members of development and product management.


All further details can be found at the registration page.

Freitag Dez 17, 2010

Solaris knows Hardware - pgstat explains it

When Sun's engineering teams observed large differences in memory latency on the E25K, they introduced the concept of locality groups (lgrp) into Solaris 9 9/02. They describe the hierarchy of system components, which can be very different in different hardware systems. When creating processes and scheduling them onto CPUs for execution, Solaris will try to minimize the distance between CPU and memory for optimal latency. This feature, known as Memory Placement Optimization (MPO) can, depending on hardware and appliation, significantly enhance performance.

There are, among many other things, thousands of counters in the Solaris kernel. They can be queried using kstat, cpustat, or more widely used tools like mpstat or iostat. Especially the counters made available with cpustat depend heavily on the underlying hardware. The it hasn't always been easy to analyze the performance benefit of MPO and the utilization of individual parts of the hardware using these counters. For cpustat, there was only a perl-script called corestat to help understand T1/T2 core utilization. This has finally changed with Solaris 11 Express


There are now three new commands: lgrpinfo, pginfo und pgstat.

lgrpinfo shows the hierarchy of the lgroups - the NUMA-architecture of the hardware. This can be useful when configuring resource groups (for containers or standalone) to select the right CPUs.

pginfo shows a different view of this information: A tree of the hardware hierarchy. The leaves of this tree are the individual integer and floatingpoint unit of each core.  Here's a little example from a T2 LDom configured with 16 strands from different cores:


# pginfo -v
0 (System [system]) CPUs: 0-15
|-- 3 (Data_Pipe_to_memory [chip]) CPUs: 0-7
| |-- 2 (Floating_Point_Unit [core]) CPUs: 0-3
| | `-- 1 (Integer_Pipeline [core]) CPUs: 0-3
| `-- 5 (Floating_Point_Unit [core]) CPUs: 4-7
| `-- 4 (Integer_Pipeline [core]) CPUs: 4-7
`-- 8 (Data_Pipe_to_memory [core,chip]) CPUs: 8-15
`-- 7 (Floating_Point_Unit [core,chip]) CPUs: 8-15
|-- 6 (Integer_Pipeline) CPUs: 8-11
`-- 9 (Integer_Pipeline) CPUs: 12-15

As you can see, the mapping of strands to pipelines and cores is easily visible.

pgstat finally, is a worthy successor of corestat. It gives you a good overview of the utilization of all components. Again, an example, on the same LDom, which at the same time shows almost 100% core utilization, something I don't find very often...


# pgstat -Apv 1 2
PG RELATIONSHIP HW UTIL CAP SW USR SYS IDLE CPUS
0 System [system] - - - 100.0% 99.6% 0.4% 0.0% 0-15
3 Data_Pipe_to_memory [chip] - - - 100.0% 99.1% 0.9% 0.0% 0-7
2 Floating_Point_Unit [core] 0.0% 179K 1.3B 100.0% 99.1% 0.9% 0.0% 0-3
1 Integer_Pipeline [core] 80.0% 1.3B 1.7B 100.0% 99.1% 0.9% 0.0% 0-3
5 Floating_Point_Unit [core] 0.0% 50K 1.3B 100.0% 99.1% 0.9% 0.0% 4-7
4 Integer_Pipeline [core] 80.2% 1.3B 1.7B 100.0% 99.1% 0.9% 0.0% 4-7
8 Data_Pipe_to_memory [core,chip] - - - 100.0% 100.0% 0.0% 0.0% 8-15
7 Floating_Point_Unit [core,chip] 0.0% 80K 1.3B 100.0% 100.0% 0.0% 0.0% 8-15
6 Integer_Pipeline 76.4% 1.3B 1.7B 100.0% 100.0% 0.0% 0.0% 8-11
9 Integer_Pipeline 76.4% 1.3B 1.7B 100.0% 100.0% 0.0% 0.0% 12-15
PG RELATIONSHIP HW UTIL CAP SW USR SYS IDLE CPUS
0 System [system] - - - 100.0% 99.7% 0.3% 0.0% 0-15
3 Data_Pipe_to_memory [chip] - - - 100.0% 99.5% 0.5% 0.0% 0-7
2 Floating_Point_Unit [core] 0.0% 76K 1.2B 100.0% 99.5% 0.5% 0.0% 0-3
1 Integer_Pipeline [core] 79.7% 1.2B 1.5B 100.0% 99.5% 0.5% 0.0% 0-3
5 Floating_Point_Unit [core] 0.0% 42K 1.2B 100.0% 99.5% 0.5% 0.0% 4-7
4 Integer_Pipeline [core] 79.8% 1.2B 1.5B 100.0% 99.5% 0.5% 0.0% 4-7
8 Data_Pipe_to_memory [core,chip] - - - 100.0% 99.9% 0.1% 0.0% 8-15
7 Floating_Point_Unit [core,chip] 0.0% 80K 1.2B 100.0% 99.9% 0.1% 0.0% 8-15
6 Integer_Pipeline 76.3% 1.2B 1.5B 100.0% 100.0% 0.0% 0.0% 8-11
9 Integer_Pipeline 76.4% 1.2B 1.5B 100.0% 99.8% 0.2% 0.0% 12-15

SUMMARY: UTILIZATION OVER 2 SECONDS

------HARDWARE------ ------SOFTWARE------
PG RELATIONSHIP UTIL CAP MIN AVG MAX MIN AVG MAX CPUS
0 System [system] - - - - - 100.0% 100.0% 100.0% 0-15
3 Data_Pipe_to_memory [chip] - - - - - 100.0% 100.0% 100.0% 0-7
2 Floating_Point_Unit [core] 76K 1.2B 0.0% 0.0% 0.0% 100.0% 100.0% 100.0% 0-3
1 Integer_Pipeline [core] 1.2B 1.5B 79.7% 79.7% 80.0% 100.0% 100.0% 100.0% 0-3
5 Floating_Point_Unit [core] 42K 1.2B 0.0% 0.0% 0.0% 100.0% 100.0% 100.0% 4-7
4 Integer_Pipeline [core] 1.2B 1.5B 79.8% 79.8% 80.2% 100.0% 100.0% 100.0% 4-7
8 Data_Pipe_to_memory [core,chip] - - - - - 100.0% 100.0% 100.0% 8-15
7 Floating_Point_Unit [core,chip] 80K 1.2B 0.0% 0.0% 0.0% 100.0% 100.0% 100.0% 8-15
6 Integer_Pipeline 1.2B 1.5B 76.3% 76.3% 76.4% 100.0% 100.0% 100.0% 8-11
9 Integer_Pipeline 1.2B 1.5B 76.4% 76.4% 76.4% 100.0% 100.0% 100.0% 12-15

The exact meaning of these values is nicely described in the manpage for pgstat, so I'll leave the interpretation to the reader. With this little tool, performance analysis, especially on T2/T3 systems, will be even more fun ;-)

Dienstag Sep 14, 2010

How auto_reg really works in Solaris 10 09/10

The newest update of Solaris 10 (09/10) brings a new feature called autoregistration. This can be automated using the new "auto_reg" option in the sysidcfg file.  Or rather, it can be sometimes.  Due to a (already known) bug, this new parameter is ignored by the GUI-installer, which will query you for the registration details no matter what you put in the sysidcfg file.  The GUI-installer usually runs if you have a screen (and video card) attached to the system.  On headless servers, the text-installer runs, which correctly acts upon "auto_reg" settings.


As a workaround for workstations, use "boot net - install nowin" instead of the usual "boot net - install", and you're all set.


 Many thanks to Peter Tribble, who suffered through this and eventually found the solution for me.

Montag Jun 07, 2010

prstat and microstate accounting

You never stop learning.  As a reply to my last blog entry, it was pointed out to me that with Solaris 10, microstate accounting is always enabled, and prstat supports this with the option "-m".  This option removes the moving average lags from the values displayed, and is much more accurate.  I wanted to know more about the background.  Eric Schrock was kind enough to provide it on his blog.  Here's a short summary.


The legacy output of prstat (and some of the other monitoring commands) represents moving averages based on regular samples.  With higher CPU frequencies, it is more and more likely that some scheduling events will be missed completely by these samples.  This makes the reports more and more unreliable.  Microstate accounting collects event statistics for every event, when the event happens.  Thanks to some implementation tricks introduced with Solaris 10, this is now efficient enough to be turned on all the time.  If you use this more precise data with prstat, a CPU hog will show up immediately, showing 100% CPU on all threads involved.  In this way, you're much more precise, and you need'nt convert from the number of CPUs in the system to the corresponding %-age as in the example in my blog entry.  A singlethreaded process will be visible instantly. This is easier do to, easier to understand, less error prone and more exact.


I've also updated the presentation to represent this.


Thanks for the hint - you know who you are!  It's from things like this that I notice that I've been using prstat and the likes (successfully) for too long .  It's just like Eric mentioned in his blog: This great feature slipped past me, with all the more prominent stuff like containers, zfs, smf etc.  Thanks again!

Donnerstag Mrz 11, 2010

Demo for Solaris Resource Manager

Here's an example how to build such a live demo:



  1. Create two projects:
    add to /etc/project:

    srm-demo:100::demo::project.cpu-shares=(privileged,3,none)
    srm-loader:101::demo::project.cpu-shares=(privileged,1,none)


  2. Start your interactive demo:

    newtask -p srm-demo java -jar /usr/java/demo/jfc/Java2D/Java2Demo.jar &


  3. Now start as many background processes as you need:

    newtask -p srm-loader someload.sh &
    where someload.sh could be:


    #!/bin/sh
    while true
    do
    echo adflkjasdflkjasdflkjasdflkj > /dev/null
    done



  4. Now move your processes to the FSS class

    priocntl -s -c FSS -i projid 101
    priocntl -s -c FSS -i projid 100

    or back to the TS class to show the effect of resource management.

    priocntl -s -c TS -i projid 100
    priocntl -s -c TS -i projid 101


Mittwoch Mrz 10, 2010

Some thoughts about scalability


In a recent email thread about scalability and why Solaris is especially good at it, some long-time performance gurus summarized the subject matter so well, that I thought it worth sharing with a broader community.  They agreed, so here it is:  

What is scalability, and why is Solaris so good in not preventing applications to scale?

The good scalability is a classic observation about systems which have been profiled for multiple years. They not only perform well at high load, they degrade less on overload.


The cause is usually described mathematically: the causes of the slope of the response time (degradation) curve is dominated by the service time of the single slowest component.  A new product usually has a few large bottlenecks, and because they're large, the response time curve take off for infinity early, and goes almost straight up. Overloading the system even a little bit causes it to "hit the wall" and seem to hang. If X is load and Y is response time, the curve looks like this:

That's called the "hockey-stick" curve in the trade ;-) Response time is fairly flat until an inflection point, then heads up like a homesick angel.

A well profiled mature product has lots of little bottlenecks, one of which is the largest, and which therefor sets the inflection point and the slope. With a small bottleneck, the slope is gentle, and during an overload, the users see the system as somewhat slow, not hung.  This looks a little like this:


The reason you get bad performance at high loads on unprofiled programs is that above 80% load, there is a good chance that multiple users will make requests at the same time, and momentarily drive the system to its inflection point into degradation. As the system is un-optimal, the degradation at that point is large and user-visible. This usually hits at around 70 or 80%, sometimes even less.

We've been hunting down and fixing the slow bits for a long time, and have a very very gentle degradation curve.  PCs, on the other hand, tend to hit the wall really easily, and often.  Some of their legendary unreliability is really bogus: users overload their machines, assume they've hung, and then reboot.

In particular, the fine-grained spin-locking in Solaris is often celebrated as being responsible for a lot of its superior scaling.  In contrast, coarse-grained locks inflate the response time of inherently-serial locks, with the resulting impact
just as Amdahl's Law would dictate.  A large set of evolved architecturally-aware features make the Solaris scheduler itself a huge factor in the superior scaling of Solaris.  Other features such as evolved AIO options and preemption
control which have been well-integrated by Oracle provide even more reasons better scaling.

I should add that superior scaling is not all about peak throughput and the average response-time curve as a function of load, but also tends to manifest as reduced variance in response times in many cases - as well as the "graceful degradation" on the far side of peak throughput that you mentioned.  Those are factors I'd like to see more-frequently characterized - but the habit in the benchmarking world is often to simply celebrate the peak results.

A last thing I'd mention is that the foundation of this was laid out when Sun's version of SVR4 was defined. We pretty much threw out AT&T's implementation and did our own with the idea of full pre-emption, multi-threading and all the rest. It's much much easier when the foundation is built solidly to deliver things on top of it. If you also need
to rebuild the foundation, it's far harder to make things work. One could make a pretty solid argument that ,without the foundation, the rest would have been much, much harder.

Big hardware on top of a great foundation leads to customers who throw more work at the boxes who expose problems that we fix that leads to customers throwing even more work at the boxes...

To see all this in action, here's what you need for a live demo:

Start the old Gnome perfmeter (or perfbar or mpstat) on a customer system running an interactive load.  You can use the Java2D demo that comes with any JDK.  Then fire up dummy CPU loads and push the %CPU higher and higher in front of the customer's eyes, until it finally starts to feel slow. They'll be amazed at how close to 100% they are before they see any actual, user-visible degradation.

And if you want to make their brain explode, use SRM to grant their app 80% of the cpu and then start dozens of dummy CPU loads in another zone to force the CPU to pin at 100%, while their performance stays fine. Of course, this is incredible enough that you may just convince them that you're faking it ;-)


We did this in a demo to techies one immersion week, and even though they knew what we were doing, there was a lot of jaws left on the demo-room floor when they saw the theory in practice.

This article has been compiled from several email messages by

David Collier-Brown
James Litchfield
Bob Sneed

Thank you!
(A German version of this article is available in the German part of this blog.)

Montag Mrz 08, 2010

Solaris lives that long...

No, this time it's not about how long a particular version of Solaris is on the market.  I covered that already (although in German only) ;-)  This time it's about how long a server can remain up and alive and deliver its service.  But see for yourself:



I saw this at a customer that fears only two things: A call from service, requesting the latest kernel patch, and the upcoming consolidation project.  The latter is the more likely event that will stop him from reaching 2500 days with this server.

About

Neuigkeiten, Tipps und Wissenswertes rund um SPARC, CMT, Performance und ihre Analyse sowie Erfahrungen mit Solaris auf dem Server und dem Laptop.

This is a bilingual blog (most of the time). Please select your prefered language:
.
The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today