Monday Aug 08, 2011

Solaris 11 Express Network Tunables

Overview

For years I, and many others, have been tuning TCP, UDP, IP, and other aspects of the Solaris network stack with ndd(1M). The ndd command is documented, however, most of the tunables were really private interface implementations, subject to change, and lacked documentation in many cases. Also, ndd does not show the default values, nor the possible values or ranges..

That is changing with Solaris 11 Express. A new command ipadm(1M) allows persistent and temporary (with the -t option) setting of key tunable values. This is a major improvement over ndd, where it is customary to create an /etc/rc2.d/S69ndd or similar script to set the parameter on every reboot. Another benefit is that ipadm shows the default value and the values that the property can be set to.

The ipadm has many features to configure the IP settings of interfaces. This blog entry focuses on how ipadm replaces ndd. Note that ipadm only supports the IP, TCP, UDP, SCTP, and ICMP protocols. Other protocols such as ipsecah and keysock still required the use of ndd.

Review of ndd

To get a list of all tunables for a specific protocol, an ndd -get operation is performed with "?" as the argument. For example, this is a way of listing all the TCP parameters.
root@Solaris11Express# ndd -get /dev/tcp \?
tcp_time_wait_interval         (read and write)
tcp_conn_req_max_q             (read and write)
tcp_conn_req_max_q0            (read and write)
tcp_conn_req_min               (read and write)
...
tcp_dev_flow_ctl               (read and write)
tcp_reass_timeout              (read and write)
tcp_extra_priv_ports_add       (write only)
tcp_extra_priv_ports_del       (write only)
tcp_extra_priv_ports           (read only)
tcp_1948_phrase                (write only)
tcp_listener_limit_conf        (read only)
tcp_listener_limit_conf_add    (write only)
tcp_listener_limit_conf_del    (write only)
To get the current value of specific parameter, list the parameter as the argument for the driver, in this case /dev/tcp.
root@Solaris11Express# ndd -get /dev/tcp tcp_conn_req_max_q
128
And to set parameter, follow it with a value.
root@Solaris11Express# ndd -set /dev/tcp tcp_conn_req_max_q 256
root@Solaris11Express# ndd -get /dev/tcp tcp_conn_req_max_q
256
And for my own benefit, I set it back to the original.
root@Solaris11Express# ndd -set /dev/tcp tcp_conn_req_max_q 128
root@Solaris11Express# ndd -get /dev/tcp tcp_conn_req_max_q
128

Using the ipadm *-prop Options

The ipadm(1M) manual page lists three sub-commands to manage TCP/IP protocol properties.
     ipadm set-prop [-t] -p prop=value[,...] protocol
     ipadm reset-prop [-t] -p prop protocol
     ipadm show-prop [[-c] -o field[,...]] [-p prop[,...]] [protocol]
To list all the properties for all the protocols as currently supported, I run ipadm with the show-prop sub-command.
root@Solaris11Express# ipadm show-prop
PROTO PROPERTY              PERM CURRENT      PERSISTENT   DEFAULT      POSSIBLE
ipv4  forwarding            rw   off          --           off          on,off
ipv4  ttl                   rw   255          --           255          1-255
ipv6  forwarding            rw   off          --           off          on,off
ipv6  hoplimit              rw   255          --           255          1-255
ipv6  hostmodel             rw   weak         --           weak         strong,
                                                                        src-priority,
                                                                        weak
ipv4  hostmodel             rw   weak         --           weak         strong,
                                                                        src-priority,
                                                                        weak
icmp  recv_maxbuf           rw   8192         --           8192         4096-65536
icmp  send_maxbuf           rw   8192         --           8192         4096-65536
tcp   ecn                   rw   passive      --           passive      never,passive,
                                                                        active
tcp   extra_priv_ports      rw   2049,4045    --           2049,4045    1-65535
tcp   largest_anon_port     rw   65535        --           65535        1024-65535
tcp   recv_maxbuf           rw   128000       --           128000       2048-1073741824
tcp   sack                  rw   active       --           active       never,passive,
                                                                        active
tcp   send_maxbuf           rw   49152        --           49152        4096-1073741824
tcp   smallest_anon_port    rw   32768        --           32768        1024-65535
tcp   smallest_nonpriv_port rw   1024         --           1024         1024-32768
udp   extra_priv_ports      rw   2049,4045    --           2049,4045    1-65535
udp   largest_anon_port     rw   65535        --           65535        1024-65535
udp   recv_maxbuf           rw   57344        --           57344        128-1073741824
udp   send_maxbuf           rw   57344        --           57344        1024-1073741824
udp   smallest_anon_port    rw   32768        --           32768        1024-65535
udp   smallest_nonpriv_port rw   1024         --           1024         1024-32768
sctp  extra_priv_ports      rw   2049,4045    --           2049,4045    1-65535
sctp  largest_anon_port     rw   65535        --           65535        1024-65535
sctp  recv_maxbuf           rw   102400       --           102400       8192-1073741824
sctp  send_maxbuf           rw   102400       --           102400       8192-1073741824
sctp  smallest_anon_port    rw   32768        --           32768        1024-65535
sctp  smallest_nonpriv_port rw   1024         --           1024         1024-32768
The first column lists the protocols. Of note is that there are separate IPv4 and IPv6 listings. Per the specification, there is no ttl for IPv6, as is seen by only an IPv4 property. IPv6 calls it the hoplimit, which is more indicative of how the value is actually used.

Including a protocol as an argument lists only those properties.

root@Solaris11Express# ipadm show-prop tcp
PROTO PROPERTY              PERM CURRENT      PERSISTENT   DEFAULT      POSSIBLE
tcp   ecn                   rw   passive      --           passive      never,passive,
                                                                        active
tcp   extra_priv_ports      rw   2049,4045    --           2049,4045    1-65535
tcp   largest_anon_port     rw   65535        --           65535        1024-65535
tcp   recv_maxbuf           rw   128000       --           128000       2048-1073741824
tcp   sack                  rw   active       --           active       never,passive,
                                                                        active
tcp   send_maxbuf           rw   49152        --           49152        4096-1073741824
tcp   smallest_anon_port    rw   32768        --           32768        1024-65535
tcp   smallest_nonpriv_port rw   1024         --           1024         1024-32768
We see the current value, whether we can set it, its default value, and the possible values or range of values. Self documenting. I like it!

To get a specific property, the -p option specifies which one to list.

root@Solaris11Express# ipadm show-prop -p send_maxbuf tcp
PROTO PROPERTY              PERM CURRENT      PERSISTENT   DEFAULT      POSSIBLE
tcp   send_maxbuf           rw   49152        --           49152        4096-1073741824
Now to set a property to a specific value, use the format property=value.
root@Solaris11Express# ipadm set-prop -p send_maxbuf=4096 tcp

root@Solaris11Express# ipadm show-prop -p send_maxbuf tcp
PROTO PROPERTY              PERM CURRENT      PERSISTENT   DEFAULT      POSSIBLE
tcp   send_maxbuf           rw   4096         4096         49152        4096-1073741824
The value of 4096 in the PERSISTENT column indicates this setting will be retained even after a reboot. To set the property only until the next reboot, use the -t option to set it temporarily.
root@Solaris11Express# ipadm set-prop -t -p send_maxbuf=4096 tcp

root@Solaris11Express# ipadm show-prop -p send_maxbuf tcp
PROTO PROPERTY              PERM CURRENT      PERSISTENT   DEFAULT      POSSIBLE
tcp   send_maxbuf           rw   4096         --           49152        4096-1073741824
While it certainly possible to set the value of property back to the same one that is the default, I like the option to set it to its default. This is done with a reset. The PERSISTENT column has reverted back to its original --.
root@Solaris11Express# ipadm reset-prop -p send_maxbuf tcp

root@Solaris11Express# ipadm show-prop -p send_maxbuf tcp
PROTO PROPERTY              PERM CURRENT      PERSISTENT   DEFAULT      POSSIBLE
tcp   send_maxbuf           rw   49152        --           49152        4096-1073741824

What About All Those Other ndd Configuration Parameters?

The output of the show-prop operation above is very small compared to what those who use ndd are used to for even just one of the protocols. So what about all the other ndd parameters?

There are two options:

  • Continue to use ndd
  • Use a special parameter conversion of the ndd parameter with ipadm
  • The first is business as usual. The second involves converting the protocol's ndd parameter into one that works with ipadm. The steps that have worked for me are as follows.

    • For any parameter, replace the /dev/protocol and use the protocol as the protocol argument to ipadm. So /dev/tcp becomes tcp.
    • Drop the leading protocol name from the beginning of the parameter, if there is one. So tcp_local_dack_interval becomes _local_dack_interval.
    • If there is no leading procotol name, prepend the property with an underscore (_). For example, tcp_local_dack_interval becomes _tcp_local_dack_interval.
    • For the IP protocol, if there are IPv4 and IPv6 ndd values, indicate the ipadm protocol as ipv4 and ipv6, respectively. With ndd, the lack of a 6 means IPv4.
    Examples of each are as follows.

    Dropping the leading protocol name and specifying it for the protocol argument.

    root@Solaris11Express# ndd -get /dev/tcp tcp_local_dack_interval
    50
    
    root@Solaris11Express# ipadm show-prop -p _local_dack_interval tcp
    PROTO PROPERTY              PERM CURRENT      PERSISTENT   DEFAULT      POSSIBLE
    tcp   _local_dack_interval  rw   50           --           50           10-500
    
    Getting a parameter that does not start with the protocol.
    root@Solaris11Express# ndd -get /dev/ip arp_probe_interval
    1500
    
    root@Solaris11Express# ipadm show-prop -p _arp_probe_interval ip
    PROTO PROPERTY              PERM CURRENT      PERSISTENT   DEFAULT      POSSIBLE
    ip    _arp_probe_interval   rw   1500         --           1500         10-20000
    
    Distinguishing between IPv4 and IPv6 parameters.
    root@Solaris11Express# ndd -get /dev/ip ip_strict_dst_multihoming
    0
    root@Solaris11Express# ndd -get /dev/ip ip6_strict_dst_multihoming
    0
    
    root@Solaris11Express# ipadm show-prop -p _strict_dst_multihoming ipv4
    PROTO PROPERTY              PERM CURRENT      PERSISTENT   DEFAULT      POSSIBLE
    ipv4  _strict_dst_multihoming rw 0            --           0            0-1
    root@Solaris11Express# ipadm show-prop -p _strict_dst_multihoming ipv6
    PROTO PROPERTY              PERM CURRENT      PERSISTENT   DEFAULT      POSSIBLE
    ipv6  _strict_dst_multihoming rw 0            --           0            0-1
    
    And when there is an error. All the fields have ? in them.
    root@Solaris11Express# ipadm show-prop -p _strict_dst_multihoming ip
    PROTO PROPERTY              PERM CURRENT      PERSISTENT   DEFAULT      POSSIBLE
    ipadm: warning: cannot get property '_strict_dst_multihoming' for 'ip'Unknown property
    ip    _strict_dst_multihoming ?  ?            ?            ?            ?
    
    As more properties are added to ipdam to manage there directly, it will become less necessary to do the ndd work-around.

    Wednesday Jun 08, 2011

    ZFS zpool and file system version numbers and features

    Often enough I have had to check the version of a ZFS pool or file system version. Sometimes, I am curious where a specific feature was delivered. So I imagine this could be useful for others. (Updated 21 Feb 2012 for Solaris 10 8/11 and Solaris 11.)

    One note is that ZFS versions are backward compatible, which means that a kernel with a newer version can import an older version. The reverse is not true. So it is important to know what the oldest kernel version you might want to attach a pool to is, and make sure you don't upgrade your pool or file system to something newer. This table may help with that as well.

    Note: This table is sorted by pool version, then file system version. The availability dates of the releases are not chronological, as a feature delivered in a version of Solaris 11 may be delivered in later Solaris 10 update.

    delivered in zpool version zfs version features comments
    Solaris 11 11/11 33 5
    • Encryption
    • Label support for Trusted Extensions
    Solaris 11 Express 2010.11 31 5
    • deduplication
    • diff for snapshots
    • read-only pool import
    • pool import with missing log device
    Solaris 10 8/11 29 5
    • ZFS installation with Flash Archives (not really a ZFS feature)
    • ZFS send will include file system properties
    • ZFS diff
    • Pool import with missing log device
    • Pool import as read-only
    • Synchronous writes
    • ACL improvements
    • Improvements in pool messages
    Solaris 10 9/10 22 4
    • triple parity RAID-Z (raidz3)
    • logbias property
    • pool recovery
    • mirror splitting
    • device replacement enhancements
    • ZFS system process
    Solaris 10 10/09 10 3
    • ZFS with flash installation
    • user and group quotas
    • ZFS cache devices (L2ARC)
    • set ZFS properties at file system creation
    • primarycache and secondarycache properties
    • log device recovery
    Solaris 10 5/09 10 3
    • zone clone creates ZFS clone
    Solaris 10 10/08 10 3
    • separate ZIL log devices
    • ZFS boot/root file system
    • zone on ZFS
    • recursive snapshot renaming
    • snapshot rollback improvements
    • snapshot send improvements
    • gzip compression
    • multiple user data copies

    • quotas and reservations can exclude snapshots/clones
    • failure mode options
    • ZFS upgrade option
    • delegated administration
    In Solaris 10 10/08 and later, zpool and zfs have the version option. It shows the version of the pool or file system, even if it is an older ZFS pool.
    Solaris 10 5/08 4 1 Pool version determined using zdb(1M) on Solaris 10 5/08
    Solaris 10 8/07 4 1
    • iSCSI support
    • zpool history
    • ability to set properties when creating file system
    Pool version determined using zdb(1M) on Solaris 10 8/07
    Solaris 10 11/06 3 1
    • recursive snapshots
    • double parity RAID-Z (raidz2)
    • clone promotion
    Pool version determined using zdb(1M) on Solaris 10 11/06
    Solaris 10 6/06 2 1
    • pool upgrade
    • restore of destroyed pool
    • integration into Solaris FMA
    • file system monitoring (fsstat)
    Initial release of ZFS in Solaris 10

    Pool version determined using zdb(1M) on Solaris 10 6/06

    The details of all the ZFS features introduced in the Solaris 10 updates are listed in Chapter 1 of the ZFS Administration Guide and for Solaris 11 Express in its ZFS Administration Guide.

    Hope this helps!

    Steffen

    Monday Apr 04, 2011

    Why Are Packets Going Out The Wrong Interface--Preserving For Historical Reasons

    I had previously referenced James Carlson's blog. Because the information is useful, and James is no longer with the company and able to update or preserve it, I am copying his posting here. Thanks again, James, for all the information regarding networking, and specifically Solaris networking, over the years!!

    Steffen

    Dated: Thursday Apr 30, 2009

    The Problem

    A common complaint for Solaris users runs something like this:

    • I have a Solaris system with two Ethernet interfaces connected to different subnets. Sometimes, I see an IP packet come in on one interface, but the packet goes back out a different one.
    • This behavior is bad for my network, because I have firewalls that check the packet sources, and they drop these misdirected packets. Why does Solaris do this? And how can I fix it? I've tried disabling routing, but that doesn't seem to help.
    Problems like this when reported are usually closed out as "will not fix," as for example CR 4085133.

    The Why

    The underlying problem here is at last partly a misunderstanding of how TCP/IP works. When a system transmits a packet, it must locate the "best" interface over which to send it. By default, the algorithm for doing that is as described in RFC 1122 section 3.3.1. Note in particular section 3.3.1.1. This requires the system to look at local interfaces first -- all of them -- to try to match the destination address. And once we find the interface by the destination address, we're done.

    That alone is enough to make things not work as expected. If you send a packet to the local address on ce0 from some other system, but that other system is best reachable through bge0, then we'll send the reply via bge0. It doesn't go back out through ce0, even if the original request came in that way.

    When considering a non-interface route (whether only the "default routes" of RFC 1122 or the more flexible CIDR routes of RFC 1812), the system will look up the route by destination IP address alone, and then use the route to obtain the output interface. This often causes the same sort of confusion when a "default route" ends up causing packets to go to the default router that the administrator thinks don't belong there.

    I actually consider this a design feature of TCP/IP, and not a flaw. It's part of the robustness that IP's datagram routing system offers: every node in the network -- hosts and routers alike -- independently determines the best way to send each distinct datagram based solely on the destination IP address. This allows for "healing" of broken networks, as the failure of one interface or router means that you can potentially still use a different (perhaps less preferred) one to send your message.

    There are some related bits of confusion in this area. For example, some programmers think that binding to a particular IP address means that the interface with that address is "bound" and all packets will go out that way. That's not correct. The system still uses the destination address to pick the output path for each individual IP packet, even if your socket is bound to an address on some particular interface. And, as long as you don't set the ip_strict_dst_multihoming ndd flag (it's not set by default), binding to an address doesn't mean that packets will only arrive on that corresponding interface. They can arrive on any interface in the system, as long as the IP address matches the one bound.

    The Solutions

    There are many ways to fix this issue, and the right answer for a given situation likely depends on the details of that situation.

    • The main issue here is the kernel's forwarding table, so putting the right things into the forwarding table is one of the first tasks.

      A common problem is that the administrator has set up a "default router," but that specified router cannot correctly forward to all possible IP destinations. Some packets the system sends end up getting misdirected or lost as a result. The solution is not having that router as a "default router," and instead using more specific routes (perhaps running a listen-only routing protocol to simplify the administrative burden).

    • Some systems have a "route by source address" feature. Solaris isn't one of those, though there is an RFE open on it (see CR 4777670). A better answer, in my opinion, would be to do something similar to what's suggested in CR 4173841. That would be, when we have multiple matching routes, to prefer a route that gives us an output interface in the same subnet as the source address.

      It's a simple tweak, and would at least fix the folks who have problems default route selection. It would not fix the problems people with interfaces on separate subnets have, though.

    • Applications that care about interface selection can use IP_BOUND_IF or IP_PKTINFO to select the specific interface desired.

      See the ip(7P) man page on your system for details.

    • If all else fails, you can use IP Filter's fastroute/to keyword on an output interface to put packets right where you want them. You should be aware that when you do this, you're circumventing IP's routing features, which means that if there's an interface or path failure, you may cause connections to fail that didn't need to fail.

    Tuesday Nov 23, 2010

    Getting GDM to work on text Solaris 11 Express 2010.11 installs

    One of the features of Solaris 11 Express is to install into a ZFS pool, which allows updates to be easily managed using ZFS snapshots and clones. The LiveCD install, however, does not offer the option to save space for another ZFS pool. I prefer to have a separate pool for data, even on my single-disk laptop. The only way to do that as I can tell is to install using the text installer. One side effect of the test installer is that it does not install everything necessary to run a GUI desktop, which is very handy on a laptop.

    Thanks to some replies to an internal question I posted, there is a relatively easy way to add the necessary packages to allow GDM and related tools to work. I have used them several times, and this writeup describes them.

    The initial text based install put 494 packages on the system.

    Solaris 11 Express 2010.11# pkg list | wc -l
    495
    Solaris 11 Express 2010.11# pkg list | head
    NAME (PUBLISHER)                              VERSION         STATE      UFOXI
    SUNWcs                                        0.5.11-0.151.0.1 installed  -----
    SUNWcsd                                       0.5.11-0.151.0.1 installed  -----
    archiver/gnu-tar                              1.23-0.151.0.1  installed  -----
    compress/bzip2                                1.0.6-0.151.0.1 installed  -----
    compress/gzip                                 1.3.5-0.151.0.1 installed  -----
    compress/p7zip                                4.55-0.151.0.1  installed  -----
    compress/unzip                                5.53.7-0.151.0.1 installed  -----
    compress/zip                                  2.32-0.151.0.1  installed  -----
    consolidation/SunVTS/SunVTS-incorporation     0.5.11-0.151.0.1 installed  -----
    
    To add the required packages to the system, the slim_install package has to be added. This adds an additional 390 packages to the system.
    Solaris 11 Express 2010.11# pkg install slim_install
                   Packages to install:   390
               Create boot environment:    No
                   Services to restart:    10
    DOWNLOAD                                  PKGS       FILES    XFER (MB)
    Completed                              390/390 42204/42204  410.5/410.5
    
    PHASE                                        ACTIONS
    Install Phase                            67952/67952
    
    PHASE                                          ITEMS
    Package State Update Phase                   390/390
    Image State Update Phase                         2/2
    
    After this, I did a reboot, just to make sure. Then I uninstalled the slim_install package, which removed only that one. The other 389 packages must have been dependencies of slim_install.
    Solaris 11 Express 2010.11# pkg uninstall slim_install
                    Packages to remove:     1
               Create boot environment:    No
    PHASE                                        ACTIONS
    Removal Phase                                828/828
    
    PHASE                                          ITEMS
    Package State Update Phase                       1/1
    Package Cache Update Phase                       1/1
    Image State Update Phase                         2/2
    
    Once I enable GDM, the screen show action and shortly I have the familiar GUI login prompt.
    Solaris 11 Express 2010.11# svcs gdm
    STATE          STIME    FMRI
    disabled       12:26:40 svc:/application/graphical-login/gdm:default
    
    Solaris 11 Express 2010.11# svcadm enable gdm
    
    Solaris 11 Express 2010.11# svcs gdm
    STATE          STIME    FMRI
    online         12:38:11 svc:/application/graphical-login/gdm:default
    
    I hope this helps others. I certainly know where to look when I have to do this again!

    Steffen

    [Updated 2010.11.23]

    First, I'd like to acknowledge Keith Mitchell who provided me with the suggestion to do the install and uninstall of the slim_install package.

    Second, in the process of checking in with Keith, he suggested taking care when doing the above operations while logged in on the console. If you leave yourself logged in at the console when GDM starts, there are small possibilities of certain devices not being configured properly when logging into gnome, due to how logindevperm works. Suggestions include:

    svcadm enable gdm && exit
    
    or
    svcadm enable gdm; exit
    
    I did this remotely, at least the most recent time, to capture the output for this blog. I did not notice any effects when I had done this the first time on a different system, however, I might have reboot at that point anyway.

    Thanks again to Keith for his tips!

    Monday Nov 15, 2010

    Solaris 11 Express 2010.11 is available!!

    Congratulations to every for getting Solaris 11 Express out! With this release come a lot of networking improvements and features, including the following:
    • Network Virtualization and Resource Management (Crossbow) with VNICs and flows
    • New IP administrative interface (ipadm)
    • IPMP rearchitecture
    • IP observability and data link statistics (dlstat)
    • Link protection when a data link is given to a guest VM
    • Layer 2 bridging
    • More device types supported by dladm, including WiFi
    • Network Automagic for automatic network selection on desktops
    • Improved socket interface for better performance
    Get more information here.

    More to follow in the future!

    Steffen

    Friday Oct 15, 2010

    New privilege added to the 'basic' Least Privilege set

    Oracle Solaris 10 9/10 (update 9) has added another privilege to the basic set of privileges, the set that all unprivileged (non-root) users have by default.

    With Least Privileges, a non-root process by default has the ability to get process information, create and delete files, fork and exec, and now separately open TCP or UDP end points. The ppriv(1) command prints the list of privileges.

    Solaris 10 9/10# ppriv -l basic
    file_link_any
    proc_exec
    proc_fork
    proc_info
    proc_session
    net_access
    
    A verbose listing includes basic descriptions, which are also described in privileges(5).

    Solaris 10 9/10# ppriv -lv basic
    file_link_any
           Allows a process to create hardlinks to files owned by a uid
           different from the process' effective uid.
    proc_exec
           Allows a process to call execve().
    proc_fork
           Allows a process to call fork1()/forkall()/vfork()
    proc_info
           Allows a process to examine the status of processes other
           than those it can send signals to.  Processes which cannot
           be examined cannot be seen in /proc and appear not to exist.
    proc_session
           Allows a process to send signals or trace processes outside its
           session.
    net_access
           Allows a process to open a TCP or UDP network endpoint.
    
    With the addition of the net_access privilege, it is now possible to prevent a process from creating sockets and network end points, isolating the process from the network. By default, processes have this privilege, so any action would be to remove it.

    To demonstrate this I am using the ppriv command to limit the privilege of a command and see with the debug flag what is happening.

    Even as an unprivileged user I can see if a specific IP address is in use with the ping command. So lets see what happens when I don't have the net_access privilege. I am doing this as a basic user.

    Solaris 10 9/10$ ppriv -D -s I-net_access -e /usr/sbin/ping 172.16.1.1
    ping[14942]: missing privilege "net_access" (euid = 1001, syscall = 5) 
       for "devpolicy" needed at spec_open+0xd0
    ping[14942]: missing privilege "net_access" (euid = 1001, syscall = 5) 
       for "devpolicy" needed at spec_open+0xd0
    ping[14942]: missing privilege "net_access" (euid = 1001, syscall = 5) 
       for "devpolicy" needed at spec_open+0xd0
    /usr/sbin/ping: unknown host 172.16.1.1
    
    Since I am forking a process with the -e option, I limit the I (inherited) privilege set with the net_access removed. The debug output shows that its net_access that is missing, and it happens three time.

    To see how it would look with the privilege, I run the same command with the basic set inherited.

    Solaris 10 9/10$ ppriv -D -s I=basic -e /usr/sbin/ping 172.16.1.1
    172.16.1.1 is alive 
    
    Everything worked, and no debug output.

    Its a good idea to use predefined sets such as basic, so that changes in the set don't affects script in the future.

    Steffen

    Thursday Jun 17, 2010

    TCP Fusion and improved loopback traffic

    In the past, when two processes were communicating using TCP on the same system, a lot of the TCP and IP protocol processing was performed just as it was for traffic to and from another system. Since a significant amount of CPU is spent on the protocol layers both sending and receiving to insure successful, complete, in order, non-duplicated, re-routed around network failures for data that never left the system, there is considerable performance benefit in providing a short circuit for the data.

    In Solaris 10 6/06 a feature called TCP Fusion was delivered, which removed all the stack processing when both ends of the TCP connection are in the same system, and now with IP Instances, in the same IP Instance (between the global zone and all shared IP zones, or within an exclusive zone). There are some exceptions to this, including when using IPsec, IPQoS, raw-socket, kernel SSL, non-simple TCP/IP conditions. or the two end points are on different squeues. A fused connect will revert to unfused if an IP Filter rule will drop a packet. However TCP fusion is done in the general case.

    So why do I bring this up? With TCP fusion enabled (which it is by default in Solaris 10 6/06 and later, and in OpenSolaris), when a TCP connection is created between processes on a system, the necessary things are set up to transfer data from the sender to the receiver without sending it down and back up the stack. The typical flow control of filling a send buffer (defaults to 48K or the value of tcp_xmit_hiwat, unless changed via a socket operation) still applies. With TCP Fusion on, there is a second check, which is the number of writes to the socket without a read. The reason for the counter is to allow the receiver to get CPU cycles, since the sender and receiver are on the same system and may be sharing one or more CPUs. The default value of this counter is eight (8), as determined by tcp_fusion_rcv_unread_min. The value per TCP connection is calculated as

    MAX(sndbuf >> 14, tcp_fusion_rcv_unread_min);
    
    Some details of the reasoning and implementation are in Change Request 4821256.

    When doing large writes, or when the receiver is actively reading, the buffer flow control dominates. However, when doing smaller writes, it is easy for the sender to end up with a condition where the number of consecutive writes without a read is exceeded, and the writer blocks, or if using non-blocking I/O, will get an EAGAIN error.

    The latter was a case at a customer of mine. An ISV application was reporting EAGAIN errors on a new installation, something that hadn't been seen before. More importantly, the ISV was also not seeing it elsewhere or in their test environment.

    After some investigation using DTrace, including reproduction on slightly different system configuration, it became clear that the sending application was getting the error after a burst of writes. The application has both local and remote (on other systems) receivers, and the EAGAIN errors were only happening on the local connection.

    I also saw that the application was repeatedly doing a pair of writes, one of 12 bytes and the second of 696 bytes. Thus it would be easy to hit the consecutive write counter before the write buffer is ever filled.

    To test this I suggested the customer change the tcp_fusion_rcv_unread_min on their running system using mdb(1). I suggested they increase the counter by a factor of four (4), just to be safe.

    # echo "tcp_fusion_rcv_unread_min/W 32" | mdb -kw
    tcp_fusion_rcv_unread_min:      0x8            =       0x20
    
    Here is how you check what the current value is.
    # echo "tcp_fusion_rcv_unread_min/D" | mdb -k
    tcp_fusion_rcv_unread_min:
    tcp_fusion_rcv_unread_min:      32
    
    After running several hours of tests, the EAGAIN error did not return.

    Since then I have suggested they set tcp_fusion_rcv_unread_min to 0, to turn the check off completely. This will allow the buffer size and total outstanding write data volume to determine whether the sender is blocked, as it is for remote connections. Since the mdb is only good until the next reboot, I suggested the customer change the setting in /etc/system.

    \* Set TCP fusion to allow unlimited outstanding writes up to the TCP send buffer set by default or the application.
    \* The default value is 8.
    set ip:tcp_fusion_rcv_unread_min=0
    
    To turn TCP Fusion off all together, something I have not tested with, the variable do_tcp_fusion can be set from its default 1 to 0.

    I hope this helps someone who might be trying to understand why errors, or maybe less than expected throughput, is being seen on local connections.

    And I would like to note that in OpenSolaris only the do_tcp_fusion setting is available. With the delivery of CR 6826274, the consecutive write counting has been removed. The TCP Fusion code has also been moved into its own file

    Thanks to Jim Eggers, Jim Fiori, Jim Mauro, Anders Parsson, and Neil Putnam for their help as I was tracking all this stuff down!

    Steffen

    PS. After publishing, I wrote this DTrace script to show what the per connection outstanding write counter tcp_fuse_rcv_unread_hiwater is set to.

    # more tcp-fuse.d
    #!/usr/sbin/dtrace -qs
    
    fbt:ip:tcp_fuse_maxpsz_set:entry
    {
            self->tcp = (tcp_t \*) arg0;
    }
    
    fbt:ip:tcp_fuse_maxpsz_set:return
    /self->tcp > 0/
    {
            this->peer = (tcp_t \*) self->tcp->tcp_loopback_peer;
            this->hiwat = this->peer->tcp_fuse_rcv_unread_hiwater;
    
            printf("pid: %d tcp_fuse_rcv_unread_hiwater: %d \\n", pid, this->hiwat);
    
            self->tcp = 0;
            this->peer = 0;
            this->hiwat = 0;
    }
    

    Wednesday Feb 24, 2010

    Solaris 10 Zones and Networking -- Common Considerations

    As often happens, a customer question resulted in this write-up. The customer had to quickly consider how they deploy a large number of zones on an M8000. They would be configuring up to twelve separate links for the different networks, and double that for IPMP. I wrote up the following. Thanks to Penny Cotten, Jim Eggers, Gordon Lythgoe, Peter Memishian, and Erik Nordmark for the feedback as I was preparing this. Also, you may see some of this in future documentation.

    Definitions

    • Datalink: An interface at Layer 2 of the OSI protocol stack, which is represented in a system as a STREAMS DLPI (v2) interface. Such an interface can be plumbed under protocol stacks such as TCP/IP. In the context of Solaris 10 Zones, datalinks are physical interfaces (e.g. e1000g0, bge1), aggregations (aggr3), or VLAN-tagged interfaces (e1000g111000 (VLAN tag 111 on e1000g0), bge111001, aggr111003). A datalink may also be referred to as a physical interface, such as when referring to a Network Interface Card (NIC). The datalink is the 'physical' property configured with the zone configuration tool zonecfg(1M).
    • Non-global Zone: A non-global zone is any zone, whether native or branded, that is configured, installed, and managed using the zonecfg(1M) and zoneadm(1M) commands in Solaris 10. A branded zone may be either Solaris 8 or Solaris 9.

    Zone network configuration: shared versus exclusive IP Instances

    Since Solaris 10 8/07, zone configurations can be either in the default shared IP Instance or exclusive IP Instance configuration.

    When configured as shared, zone networking includes the following characteristics.

    • All datalink and IP, TCP, UDP, SCTP, IPsec, etc. configuration is done in the global zone.
    • All zones share the network configuration settings, including datalink, IP, TCP, UDP, etc. This includes ndd(1M) settings.
    • All IP addresses, netmasks, and routes are set by the global zone and can not be altered in a non-global zone.
    • Non-global zones can not utilize DHCP (neither client nor server). There is a work-around that may allow a zone to be a DHCP server.
    • By default a privileged user in a non-global zone can not put a datalink into promiscuous mode, and thus can not run things like snoop(1M). Changing this requires adding the priv_net_raw privilege to the zone from the global zone, and also requires identifying which interface(s) to allow promiscuous mode on via the 'match' zonecfg parameter. Warning: This allows the non-global zone to send arbitraty packets on those interfaces.
    • IPMP configuration is managed in the global zone and applies to all zones using the datalinks in the IPMP group. All non-global zones configured with one datalink can or must use all datalinks in the IPMP group. Non-global zones can use multiple IPMP groups. The zone must be configured with only one datalink from each IPMP group.
    • Only default routes apply to the non-global zones, as determined by the IP address(es) assigned to the zone. Non-default static routes are not supported to direct traffic leaving a non-global zone.
    • Multiple zones can share a datalink.
    When configured as exclusive, zone networking includes the following characteristics.
    • All network configuration can be done within the non-global zone (and can also be done indirectly from the global zone (via zlogin(1) or editing the files in the non-global zone's root file system).
    • IP and above configurations can not be seen directly within the global zone (e.g. running ifconfig(1M) in the global zone will not show the details of a non-global zone).
    • The non-global zone's interface(s) can be configured via DHCP, and the zone can be a DHCP server.
    • A privileged user in the non-global zone can fully manipulate IP address, netmask, routes, ndd variables, logical interfaces, ARP cache, IPsec policy and keys, IP Filter, etc.
    • A privileged user in the non-global zone can put the assigned interface(s) into promiscuous mode (e.g. can run snoop).
    • The non-global zone can have unique IPsec properties.
    • IPMP must be managed within the non-global zone.
    • A datalink can only be used by a single running zone at any one time.
    • Commands such as snoop(1M) and dladm(1M) can be used on datalinks in use by running zones.
    It is possible to mix shared and exclusive IP zones on a system. All shared zones will be sharing the configuration and run time data (routes, ARP, IPsec) of the global zone. Each exclusive zone will have its own configuration and run time data, which can not be shared with the global zone or any other exclusive zones.

    IP Multipathing (IPMP)

    By default, all IPMP configurations are managed in the global zone and affects all non-global zones whose network configuration includes even one datalink (the net->physical property in zonecfg(1M)) in the IPMP group. A zone configured with a datalinks that are part of IPMP groups must only configure each IP address on only one of the datalinks in the IPMP group. It is not necessary to configure an IP address on each datalink in the group. The global zone's IPMP infrastructure will manage the fail-over and fail-back of datalinks on behalf of all the shared IP non-global zones.

    For exclusive IP zones, the IPMP configuration for a zone must be managed from within the non-global zone, either via the configuration files or zlogin(1).

    The choice to use probe-based failure detection or link-based failure detection can be done on a per-IPMP group basis, and does not affect whether the zone can be configured as shared or exclusive IP Instance. Care must be taken when selecting test IP addresses, since they will be configured in the global zone and thus may affect routing for either the global or for the non-global zones.

    Routing and Zones

    The normal case for shared-IP zones is that they use the same datalinks and the same IP subnet prefixes as the global zone. In that case the routing in the shared-IP zones are the same as in the global zone. The global zone can use static or dynamic routing to populate its routing table, that will be used by all the shared-IP zones.

    In some cases different zones need different IP routing. The best approach to accomplish this is to make those zones be exclusive-IP zones. If this is not possible, then one can use some limited support for routing differentiation across shared-IP zones. This limited support only handles static default routes, and only works reliably when the shared-IP zones use disjoint IP subnets.

    All routing is managed by zone that owns the IP Instance. The global zones owns the 'default' IP Instance that all shared IP zones use. Any exclusive IP zone manages the routes for just that zone. Different routing policies, routing daemons, and configurations can be used in each IP Instance.

    For shared IP zones, only default static routes are supported with those zones. If multiple default routes apply to a non-global zone, care must be taken that all the default routes are able to reach all the destinations that the zone need to reach. A round robin policy is used when multiple default routes are available and a new route needs to be determined.

    The zonecfg(1M) 'defrouter' property can be used to define a default router for a specific shared IP zone. When a zone is started and the parameter is set, a default route on the interface configured for that zone will be created if it does not already exist. As of Solaris 10 10/09, when a zone stops, the default route is not deleted.

    Default routes on the same datalink and IP subnet are shared across non-global zones. If a non-global zone is on the same datalink and subnet as the global zone, default route(s) configured for one zone will apply for all other zones on that datalink and IP subnet.

    Inter-zone network traffic isolation

    There are several ways to restrict network traffic between non-global shared IP zones.
    • The /dev/ip ndd(1M) paramter 'ip_restrict_interzone_loopback', managed from the global zone, will force traffic out of the system on a datalink if the source and destination zones do not share a datalink. The default configuration for this is to allow inter-zone networking using internal loopback of IP datagrams, with the value of this parameter set to '0'. When the value is set to '1', traffic to an IP address in another zone in the shared IP Instance that is not on the same datalink will be put onto the external network. Whether the destination is reached will depend on the full network configuration of the system and the external network. This applies whether the source and destination IP address are on the same or different IP subnets. This parameter applies to all IP Instances active on the system, including exclusive IP Instance zones. In the case of exclusive IP zones, this will apply only if the zone has more than one datalink configured with IP addresses. The for two zones on the same system to communicate with the 'ip_restrict_interzone_loopback' set to '1' requires the following conditions.
      • There is a network path to the destination. If on the same subnet, the switch(es) must allow the connection. If on different subnets, routes must be in place for packets to pass reliably between the two zones.
      • The destination address is not on the same datalink (as this would break the datalink rules).
      • The destination is not on datalink in an IPMP group that the sending datalink is also in.
      The 'ip_restrict_interzone_loopback' parameter is available in Solaris 10 8/07 and later.
    • A route(1M) action to prevent traffic between two IP addresses is available. Using the '-reject' flag will generate an ICMP unreachable when this route is attempted. The '-blackhole' flag will silently discard datagrams.
    • The IP Filter action 'intercept_loopback' will filter traffic between sockets on a system, including traffic between zones and loopback traffic within a zone. Using this action prevents traffic between shared IP zones. It does not force traffic out of the system using a datalink. More information is in the ipf.conf(4) or ipf(4) manual page.

    Aggregations

    Solaris 10 1/06 and later support IEEE 802.3ad link aggregations using the dladm(1M) datalink administration command. Combining two or more datalinks into an aggregation effectively reduces the number of datalinks available. Thus it is important to consider the trade-offs between aggregations and IPMP when requiring either network availability or increased network bandwidth. Full traffic patterns must be understood as part of the decision making process.

    For the 'ce' NIC, Sun Trunking 1.3.1 is available for Solaris 10.

    Some considerations when making a decision between link aggregation and IPMP are the following.

    • Link aggregation requires support and configuration of aggregations on both ends of the link, i.e. both the system and the switch.
    • Most switches only support link aggregation within a switch, not spanning two or more switches.
    • Traffic between a single pair of IP addresses will typically only utilize one link in either an aggregation or IPMP group.
    • Link aggregation only provides availability between the switch ports and the system. IPMP using probe-based failure detection can redirect traffic around internal switch problems or network issues behind the switches.
    • Multiple hashing policies are available, and they can be set differently for inbound and outbound traffic.
    • IPMP probe-based failure detection required test addresses for each datalink in the IPMP group, which are in addition to the application or data address(es).
    • IPMP link-based failure detection will cause a fail-over or fail-back based on link state only. Solaris 10 supports IPMP configured in only link-based mode. If IPMP is configured in probe-based failure detection, link failure will also cause fail-over, and a link restore will cause a fail-back.
    • A physical interface can be in only one aggregation. VLANs can be configured over an aggregation.
    • A datalink can be in only one IPMP group.
    • An IPMP group can use aggregations as the underlying datalinks.

    Note, this is for Solaris 10. OpenSolaris has differences. Maybe something for another day.

    I hope this is helpful! Steffen

    My thoughts on configuring zones with shared IP instances and the 'defrouter' parameter

    An occasional call or email I receive has questions about routing issues when using Solaris Zones in the (default) shared IP Instance configuration. Everything works well when the non-global zones are on the same IP subnet (lets say 172.16.1.0/24) as the global zone. Routing gets a little tricky when the non-global zones are on a different subnet.

    My general recommendation is to isolate. This means:

    • Separate subnets for the global zone (administration, backup) and the non-global zones (applications, data).
    • Separate data-links for the global and non-global zones.
      • The non-global zones can share a data-link
      • Non-global zones on different IP subnets use different data-links
    Using separate data-links is not always possible. I was concerned whether this would actually work.

    So I did some testing, and exchanged some emails because of a comment I made regarding PSARC/2008/057 and the automatic removal of a default route when the zone is halted.

    Turns out I have been very restrictive in suggesting that the global and non-global zones not share a data-link. While I think that is a good administrative policy, to separate administrative and application traffic, it is not a requirement. It is OK to have the global zone and one or more non-global zones share the same data-link. However, if the non-global zones are to have different default routes, they must be on subnets that the global zone is not on.

    My test case running Solaris 10 10/09 has the global zone on the 129.154.53.0/24 network and the non-global zone on the 172.16.27.0/24 network.

    global# ifconfig -a
    ...
    e1000g0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
            inet 129.154.53.132 netmask ffffff00 broadcast 129.154.53.255
            ether 0:14:4f:ac:57:c4
    e1000g0:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
            zone shared1
            inet 172.16.27.27 netmask ffffff00 broadcast 172.16.27.255
    
    global# zonecfg -z shared1 info net
    net:
            address: 172.16.27.27/24
            physical: e1000g0
            defrouter: 172.16.27.16
    
    
    The routing table as seen from both are:
    global# netstat -rn
    
    Routing Table: IPv4
      Destination           Gateway           Flags  Ref     Use     Interface
    -------------------- -------------------- ----- ----- ---------- ---------
    default              129.154.53.215       UG        1        123
    default              172.16.27.16         UG        1          7 e1000g0
    129.154.53.0         129.154.53.132       U         1         50 e1000g0
    224.0.0.0            129.154.53.132       U         1          0 e1000g0
    127.0.0.1            127.0.0.1            UH        3         80 lo0
    
    shared1# netstat -rn
    
    Routing Table: IPv4
      Destination           Gateway           Flags  Ref     Use     Interface
    -------------------- -------------------- ----- ----- ---------- ---------
    default              172.16.27.16         UG        1          7 e1000g0
    172.16.27.0          172.16.27.27         U         1          3 e1000g0:1
    224.0.0.0            172.16.27.27         U         1          0 e1000g0:1
    127.0.0.1            127.0.0.1            UH        4         78 lo0:1
    
    While the global zone shows both routes, only the default applying to its subnet will be used. And for traffic leaving the non-global zone, only its default will be used.

    You may notice that the Interface for the global zone's default router is blank. That is because I have set the default route via /etc/defaultrouter. I noticed that if it is determined via the route discovery daemon, it will be listed as being on e1000g0! This does not affect the behavior, however it may be visually confusing, which is probably why I initially leaned towards saying to not share the data-link.

    There are multiple ways to determining which route might be used, including ping(1M) and traceroute(1M). I like the output of the route get command.

    global# route get 172.16.29.1
       route to: 172.16.29.1
    destination: default
           mask: default
        gateway: 129.154.53.1
      interface: e1000g0
          flags: <UP,GATEWAY,DONE,STATIC>
     recvpipe  sendpipe  ssthresh    rtt,ms rttvar,ms  hopcount      mtu     expire
           0         0         0         0         0         0      1500         0
    
    shared1# route get 172.16.28.1
       route to: 172.16.28.1
    destination: default
           mask: default
        gateway: 172.16.27.16
      interface: e1000g0:1
          flags: <UP,GATEWAY,DONE,STATIC>
     recvpipe  sendpipe  ssthresh    rtt,ms rttvar,ms  hopcount      mtu     expire
           0         0         0         0         0         0      1500         0
    
    This quickly shows which interfaces and IP addresses are being used. If there are multiple default routes, repeated invocations of this will show a rotation in the selection of the default routes.

    Thanks to Erik Nordmark and Penny Cotten for their insights on this topic!

    Steffen Weiberle

    Thursday Aug 20, 2009

    Why are packets going out of the "wrong" interface?

    I often refer to this blog by James Carlson, so to help others, and me, find it, here is Packets out of the wrong interface. Thanks James for all the help over the years!

    Steffen

    VLANs and Aggregations

    Every once in a while I see the question asking whether it is possible to use IEEE 802.1q VLANs together with IEEE 802.3ad Link Aggregation. I frequently have to check myself. So in order to better remind me, and share with others, here is a quick demonstration of how to get the two working together.

    My test system is running build 05 of the upcoming Solaris 10 10/09 (update 8). The system has four bge interfaces, and I will use numbers 1 and 2. (This should work just as well with previous updates of Solaris 10, and with Sun Trunking in Solaris 9, except for the zones parts. I am using zones just to isolate my traffic generation and easily get it to use a specific data link.)

    Starting out things like like this.

    global# dladm show-dev
    bge0            link: up        speed: 1000  Mbps       duplex: full
    bge1            link: unknown   speed: 0     Mbps       duplex: unknown
    bge2            link: unknown   speed: 0     Mbps       duplex: unknown
    bge3            link: unknown   speed: 0     Mbps       duplex: unknown
    global# dladm show-link
    bge0            type: non-vlan  mtu: 1500       device: bge0
    bge1            type: non-vlan  mtu: 1500       device: bge1
    bge2            type: non-vlan  mtu: 1500       device: bge2
    bge3            type: non-vlan  mtu: 1500       device: bge3
    global# ifconfig -a4
    lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
            inet 127.0.0.1 netmask ff000000
    bge0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
            inet 129.154.53.125 netmask ffffff00 broadcast 129.154.53.255
            ether 0:3:ba:e3:42:8b
    
    I have my switch set up to aggregate ports 1 and 2, and here is how I do it with Solaris 10.
    global# dladm create-aggr -d bge1 -d bge2 1
    global# dladm show-link
    bge0            type: non-vlan  mtu: 1500       device: bge0
    bge1            type: non-vlan  mtu: 1500       device: bge1
    bge2            type: non-vlan  mtu: 1500       device: bge2
    bge3            type: non-vlan  mtu: 1500       device: bge3
    aggr1           type: non-vlan  mtu: 1500       aggregation: key 1
    
    VLAN tagged interfaces are used by accessing the underlying data link by preceeding the data link ID with the VLAN tag. For bge1 and VLAN 111 that would be bge111001. For for aggr1 it would be aggr111001.

    For this setup I am using zones zone111 and zone112 configured as an exclusive IP Instance. The zone configuration look like this.

    global# zonecfg -z zone111 info
    zonename: zone111
    zonepath: /zones/zone111
    brand: native
    autoboot: false
    bootargs:
    pool:
    limitpriv:
    scheduling-class:
    ip-type: exclusive
    inherit-pkg-dir:
            dir: /lib
    inherit-pkg-dir:
            dir: /platform
    inherit-pkg-dir:
            dir: /sbin
    inherit-pkg-dir:
            dir: /usr
    net:
            address not specified
            physical: aggr111001
            defrouter not specified
    
    Once configured, installed, and booted, the network configuration of zone111 is:
    global# zlogin zone111 ifconfig -a4
    lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
            inet 127.0.0.1 netmask ff000000
    aggr111001: flags=201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS> mtu 1500 index 2
            inet 172.16.111.141 netmask ffffff00 broadcast 172.16.111.255
            ether 0:3:ba:e3:42:8c
    
    Turns out that configuring this was easy compared to showing that the link aggregation was really working. While the full list of links known when the zones are includes the aggregation and the VLANs on the aggregation, tools such a netstat or nicstat would not include them. As it turns out they only report on interfaces that are plumbed up in that IP Instance. It will not be possible to plumb either bge1 or bge2 since they are members of the aggregation.
    global# dladm show-link
    bge0            type: non-vlan  mtu: 1500       device: bge0
    bge1            type: non-vlan  mtu: 1500       device: bge1
    bge2            type: non-vlan  mtu: 1500       device: bge2
    bge3            type: non-vlan  mtu: 1500       device: bge3
    aggr1           type: non-vlan  mtu: 1500       aggregation: key 1
    aggr111001      type: vlan 111  mtu: 1500       aggregation: key 1
    aggr112001      type: vlan 112  mtu: 1500       aggregation: key 1
    global# netstat -i
    Name  Mtu  Net/Dest      Address        Ipkts  Ierrs Opkts  Oerrs Collis Queue
    lo0   8232 loopback      localhost      98     0     98     0     0      0
    bge0  1500 pinebarren    pinebarren     43101  0     7181   0     0      0
    
    So I ended up using kstat(1M) to get the values of the number of outbound packets. I an interested in outbound as that is what Solaris can affect regarding distributing traffic across links in an aggregation--the switch determines that for inbound traffic.

    This example shows data on instance 2 of the bge interface for kstat value opackets.

    global# kstat -m bge -i 2 -s opackets
    module: bge                             instance: 2
    name:   mac                             class:    net
            opackets                        2542
    
    With kstat I can see that for different connections either bge1 or bge2 has packets going out on it. A good test for me was scp to a remote system. Neither ping nor traceroute caused the necessary hashing to use both links in the aggregation.

    Steffen

    Wednesday Jun 17, 2009

    ssh and friends scp, sftp say "hello crypto!"

    Solaris includes the SunSSH toolset (ssh, scp, and sftp) in Solaris 9 and later. Solaris 10 comes with the Solaris Cryptographic Framework that provides an easy mechanism for applications that use PKCS #11, OpenSSL, Java Security Extensions, or the NSS interface to take advantage of cryptographic hardware or software on the system.

    Separately, the UltraSPARC® T2 processor in the T-series (CMT) has built-in cyptographic processors (one per core, or typically eight per socket) that accelerate secure one-way hashes, public key session establishment, and private key bulk data transfers. The latter is useful for long standing connections and for larger data operations, such as a file transfer.

    Prior to Solaris 10 5/09, an scp or sftp file transfer operation had the encryption and decryption done the by the CPU. While usually this is not a big deal, as most CPUs do private key crypto reasonably fast, on the CMT systems these operations are relatively slow. Now with SunSSH With OpenSSL PKCS#11 Engine Support in 5/09, the SunSSH server and client will use the cryptographic framework when an UltraSPARC® T2 process nc2p cryptographic unit is available.

    To demonstrate this, I used a T5120 with Logical Domains (LDoms) 1.1 configured running Solaris 10 5/09. Using LDoms helps, as I can assign or remove crypto units on a per-LDom basis. (Since the crypto units are not supported yet with dynamic reconfiguration, a reboot of the LDom instance is required. However, in general, I don't see making that kind of change very often.)

    I did all the work in the 'primary' control and service LDom, where I have direct access to the network devices, and can see the LDom configuration. I am listing parts of it here, although this is about Solaris, SunSSH, and the crypto hardware.

    medford# ldm list-bindings primary
    NAME             STATE      FLAGS   CONS    VCPU  MEMORY   UTIL  UPTIME
    primary          active     -n-cv-  SP      16    8G       0.1%  22h 16m
    
    MAC
        00:14:4f:ac:57:c4
    
    HOSTID
        0x84ac57c4
    
    VCPU
        VID    PID    UTIL STRAND
        0      0      0.6%   100%
        1      1      1.9%   100%
        2      2      0.0%   100%
        3      3      0.0%   100%
        4      4      0.0%   100%
        5      5      0.1%   100%
        6      6      0.0%   100%
        7      7      0.0%   100%
        8      8      0.7%   100%
        9      9      0.1%   100%
        10     10     0.0%   100%
        11     11     0.0%   100%
        12     12     0.0%   100%
        13     13     0.0%   100%
        14     14     0.0%   100%
        15     15     0.0%   100%
    
    MAU
        ID     CPUSET
        0      (0, 1, 2, 3, 4, 5, 6, 7)
        1      (8, 9, 10, 11, 12, 13, 14, 15)
    
    MEMORY
        RA               PA               SIZE
        0x8000000        0x8000000        8G
    
    The 'system' has 16 CPUs (hardware strands), two MAUs (those are the crypto units), and 8 GB of memory. I am using e1000g0 for the network and the remote system is a V210 running Solaris Express Community Edition snv_113 SPARC (OK, I am a little behind). The network is 1 GbE.

    The command I run is

    source#/usr/bin/time scp -i /.ssh/destination /large-file destination:/tmp
    
    source# du -h /large-file
     1.3G   /large-file
    
    My results with the crypto units were
    real     1:13.6
    user       32.2
    sys        34.5
    
    while without the crypto units
    real     2:28.2
    user     2:10.9
    sys        26.8
    
    The transfer took one half the time and considerably less CPU processing with the crypto units in place (I have two although I think it is using only one since this is a single transfer).

    So, SunSSH benefits from the built-in cryptographic hardware in the UltraSPARC® T2 process!

    Steffen

    Monday Jun 01, 2009

    OpenSolaris 2009.06 Delivers Crossbow (Network Virtualization and Resource Control)

    Today OpenSolaris 2009.06, the third release of OpenSolaris, is announced and available for download. Among the many features in this version is the delivery of Project Crossbow, in a fully supported distribution. This brings network virtualization, including Virtual NICs (VNICs), bandwidth control and management, flow (QoS) creation and management, virtual switches, and other features to OpenSolaris.

    Network virtualization joins a number of other features already in OpenSolaris, such as vanity naming (allowing custom names for data links), snooping on loopback for better observability, a re-architected IPMP with an administrative interface, and Network Automagic (NWAM--automatic configuration of desktop networking based on available wired and wireless network services).

    Congratulations to everyone who made all this possible!

    Steffen PS: Regarding the fully supported, please notice the new support prices and durations!

    Thursday Apr 16, 2009

    Sun Shared Shell - A Cool Diagnostic Tool

    [Updated 2010.10.12 with new URL]

    As part of helping a customer out recently on an escalation, the SSE on the case suggested using Sun Shared Shell, a tool that allows you to see and optionally control a remote system. It supports SSH and Telnet.

    This tool was instrumental in increasing my understanding of what was going on with the customer's system, and removed the need to wait for output via emails or just trying to understand things over the phone. The owner of the session, usually the customer, has the option of allowing you to enter commands (without hitting 'Return'), or even allowing the 'Return' as well. It also has logging and chatting capabilities.

    When first logging in, it allows you to be the owner of the shell and share that with other participants, or to view someone else's shell session.

    Once logged in, you have a terminal window, the people present on the connection, and a chat window. The icon before the name/email address shows whether you have view, type, or full control (the keyboard will also have a down-arrow with it).

    Oh, and I forgot about the feature to scribble on the screen. I used that to diagram out an idea I had to solve a zone networking issue, and it helped the others understand what I was proposing a lot quicker!

    In the spirit of 'asking for what you want instead of complaining about what you don't have', I submitted a few suggestions, and the owner(s) quickly responded with clarifications.

    I see this as a great tool to help future cases where a shared view of operations will improve understanding or service delivery! Thanks to those who created and maintain it! Steffen

    What happened to my packets? -- or -- Dual default routes and shared IP zones

    I recently received a call from someone who has helped me out a lot on some performance issues (thanks, Jim Fiori), and I was glad to be able to return even a small part of those favors!

    He had been contacted to help a customer who was ready to deploy a web application, and they were experiencing intermittent lack of connection to the web site. Interestingly, they were also using zones, a bunch of them (OK, a handful)--and so right up my alley.

    The customer was running a multi-tiered web application on an x4600 (so Solaris on x86 as well!), with the web server, web router, and application tiers in different zones. They were using shared IP Instances, so all the network configuration was being done in the global zone.

    Initially, we had to modify some configuration parameters, especially regarding default routes. Since the system was installed with Solaris 10 5/08 and had more recent patches, we could use the defrouter feature introduced in 10/08 to make setting up routes for the non-global zones a little easier. This was needed because the global zone was using only one NIC, and it was not going to be on the networks that the non-global zones were on.

    What made the configuration a little unique was that the web server needs a default router to the Internet, while the application server needs a route to other systems behind a different router. Individually, everything is fine. However, the web1 zone also needs to be on the network that the application and web router are on, so it ends up having two interfaces.

    Lets look at web1 when only it is running.

    web1# ifconfig -a4
    lo0:1: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
            inet 127.0.0.1 netmask ff000000
    bge1:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
            inet 172.16.1.41 netmask ffffff00 broadcast 172.16.1.255
    bge2:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 4
            inet 192.168.51.41 netmask ffffff00 broadcast 192.168.51.255
    web1# netstat -rn
    
    Routing Table: IPv4
      Destination           Gateway           Flags  Ref     Use     Interface
    -------------------- -------------------- ----- ----- ---------- ---------
    default              172.16.1.1           UG        1          0 bge1
    172.16.1.0           172.16.1.41          U         1          0 bge1:1
    192.168.51.0         192.168.51.41        U         1          0 bge2:1
    224.0.0.0            172.16.1.41          U         1          0 bge1:1
    127.0.0.1            127.0.0.1            UH        5         34 lo0:1
    

    The zone is on two interface, bge1 and bge2, and has a default route that uses bge1. However, when zone app1 is running, there is a second default route, on bge2. The same is true if app2 or odr are running. Note that these three zones are only on bge2.

    app1# ifconfig -a4
    lo0:1: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
            inet 127.0.0.1 netmask ff000000
    bge2:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 4
            inet 192.168.51.43 netmask ffffff00 broadcast 192.168.51.255
    app1# netstat -rn
    
    Routing Table: IPv4
      Destination           Gateway           Flags  Ref     Use     Interface
    -------------------- -------------------- ----- ----- ---------- ---------
    default              192.168.51.1         UG        1          0 bge2
    192.168.51.0         192.168.51.43        U         1          0 bge2:1
    224.0.0.0            192.168.51.43        U         1          0 bge2:1
    127.0.0.1            127.0.0.1            UH        3         51 lo0:1
    

    In the meantime, this is what happens in web1.

    web1# netstat -rn
    
    Routing Table: IPv4
      Destination           Gateway           Flags  Ref     Use     Interface
    -------------------- -------------------- ----- ----- ---------- --------- 
    default              192.168.51.1         UG        1          0 bge2
    default              172.16.1.1           UG        1          0 bge1 
    172.16.1.0           172.16.1.41          U         1          0 bge1:1
    192.168.51.0         192.168.51.41        U         1          0 bge2:4
    224.0.0.0            172.16.1.41          U         1          0 bge1:1
    127.0.0.1            127.0.0.1            UH        6        132 lo0:4
    

    With any of the other zones running, web1 now has two default routes. And it only happens in web1, as it is the only zone with its public facing data link bge1 and a shared data link (bge2).

    Traffic to any system on either the 192.168.51.0 or 172.16.1.1 network will have no issues. Every time IP needs to determine a new path for a system not on either of those two networks, it will pick a route, and it will round-robin between the two default routes. Thus approximately half the time, connections will fail to establish, or possibly existing connections will not work if they have been idle for a while.

    This is how IP is supposed to work, so there is technically nothing wrong. It is a features of zones and a shared IP Instance. [2009.06.23: For background on why IP works this way, see James' blog].

    The only problem is that this is not what the customer wants!

    One option would be to force all traffic between the web and application tier out the bge1 interface, putting it on the wire. This may not be desirable for security reasons, and introduces latencies since traffic now goes on the wire. Another option would be to use exclusive IP Instances for the web servers. For each web zone, and this example only has one, it would required two additional data links (NICs). That would add up. Also, this configuration is targeted to be used with Solaris Cluster's scalable services, and those must be in shared IP Instance zones. Hummm....as I like to say.

    We didn't know about the shared IP Instance restriction of Solaris Cluster, and as the customer was considering how they were going to add additional NICs to all the systems, something slowly developed in my mind. How about creating a shared, dummy network between the web and application tier? They had one spare NIC, and with shared IP it does not even need to be connected to a switch port, since IP will loop all traffic back anyway!

    The more I thought about it, the more I liked it, and I could not see anything wrong with it. At least not technically as I understood Solaris. Operationally, for the customer, it might be a little awkward.

    Here is what I was thinking of...

    With this configuration the web1 zone has a default router only to the Internet and it can reach odr, and if necessary, app1 and app2, directly via the new network. And app1 and app2 only have a single default route to get to the Intranet. The nice thing is that bge3 does not even need to be up. That is visible with ifconfig output, where bge3 is not showing a RUNNING flag, which indicates the port is not connected (or in my case has been disabled on the switch).

    global# ifconfig -a4
    ...
    bge0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
            inet 129.154.53.125 netmask ffffff00 broadcast 129.154.53.255
            ether 0:3:ba:e3:42:8b
    bge1: flags=1000842<BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
            inet 0.0.0.0 netmask 0
            ether 0:3:ba:e3:42:8c
    bge2: flags=1000842<BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 4
            inet 0.0.0.0 netmask 0
            ether 0:3:ba:e3:42:8d 
    bge3: flags=1000802<BROADCAST,MULTICAST,IPv4> mtu 1500 index 5 
            inet 0.0.0.0 netmask 0
            ether 0:3:ba:e3:42:8e
    ...
    
    And within web1 there is now only one default route.
    web1# netstat -rn
    
    Routing Table: IPv4
      Destination           Gateway           Flags  Ref     Use     Interface
    -------------------- -------------------- ----- ----- ---------- --------- 
    default              172.16.1.1           UG        1         17 bge1 
    172.16.1.0           172.16.1.41          U         1          2 bge1:1
    192.168.52.0         192.168.52.41        U         1          2 bge3:1
    224.0.0.0            172.16.1.41          U         1          0 bge1:1
    127.0.0.1            127.0.0.1            UH        4        120 lo0:1
    
    In the customer's case, multiple systems were being used, so the private networks were connected together so that a web zone on one system could access an odr zone on another. I am showing the simple, single system case since it is so convenient.

    If I were using Solaris Express Community Edition (SX-CE) or OpenSolaris 2009.06 Developer Builds, with the Crossbow bits and virtual NICs (VNICs) available, I wouldn't even have needed to use that physical interface. Both are available here.

    I hope this trick might help others out in the future.

    Steffen

    Tuesday Apr 14, 2009

    Using IPMP with link based failure detection

    Solaris has had a feature to increase network availability called IP Multipathing (IPMP). Initially it required a test address on every data link in an IPMP group, where the test addresses were used as the source IP address to probe network elements for path availability. One of the benefits of probe-based failure detection is that it can extend beyond the directly connected link(s), and verify paths through the attached switch(es) to what typically is a router or other redundant element to provide available services.

    Having one IP address (whether a public or a private, non routable) per data link and also the separate address(es) for the application(s) turns out to be a lot of addresses to allocate and administer. And since the default of five probes spaced two seconds apart meant a failure would take at least ten (10) seconds to be detected, something more was needed.

    So in the Solaris 9 timeframe the ability to also do link based failure detection was delivered. It requires specific NICs whose driver has the ability to notify the system that a link has failed. The Introduction to IPMP in the Solaris 10 Systems Administrators Guide on IP Services lists the NICs that support link state notification. Solaris 10 supports configuring IPMP with only link based failure detection.

    global# more /etc/hostname.bge[12]
    ::::::::::::::
    /etc/hostname.bge1
    ::::::::::::::
    10.1.14.140/26 group ipmp1 up
    ::::::::::::::
    /etc/hostname.bge2
    ::::::::::::::
    group ipmp1 standby up
    
    On system boot, there will be an indication on the console that since no test addresses are defined, probe-based failure detection is disabled.

    Apr 10 10:57:20 in.mpathd[168]: No test address configured on interface bge2; disabling probe-based failure detection on it
    Apr 10 10:57:20 in.mpathd[168]: No test address configured on interface bge1; disabling probe-based failure detection on it
    
    Looking at the interfaces configured,
    global# ifconfig -a4
    lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
            inet 127.0.0.1 netmask ff000000
    bge0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
            inet 129.154.53.125 netmask ffffff00 broadcast 129.154.53.255
            ether 0:3:ba:e3:42:8b
    bge1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
            inet 10.1.14.140 netmask ffffffc0 broadcast 10.1.14.191
            groupname ipmp1
            ether 0:3:ba:e3:42:8c
    bge1:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
            inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
    bge2: flags=69000842<BROADCAST,RUNNING,MULTICAST,IPv4,NOFAILOVER,STANDBY,INACTIVE> mtu 0 index 4
            inet 0.0.0.0 netmask 0
            groupname ipmp1
            ether 0:3:ba:e3:42:8d
    
    you will notice that two of the three interfaces have no address (0.0.0.0). Also, the data address is on a physical interface on bge1. At the same time bge2 has the 0.0.0.0 address. On the failure of bge1,
    Apr 10 14:34:53 global bge: NOTICE: bge1: link down
    Apr 10 14:34:53 global in.mpathd[168]: The link has gone down on bge1
    Apr 10 14:34:53 global in.mpathd[168]: NIC failure detected on bge1 of group ipmp1
    Apr 10 14:34:53 global in.mpathd[168]: Successfully failed over from NIC bge1 to NIC bge2
    
    
    global# ifconfig -a4
    lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
            inet 127.0.0.1 netmask ff000000
    bge0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
            inet 129.154.53.125 netmask ffffff00 broadcast 129.154.53.255
            ether 0:3:ba:e3:42:8b
    bge1: flags=19000802<BROADCAST,MULTICAST,IPv4,NOFAILOVER,FAILED> mtu 0 index 3
            inet 0.0.0.0 netmask 0
            groupname ipmp1
            ether 0:3:ba:e3:42:8c
    bge2: flags=21000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,STANDBY> mtu 1500 index 4
            inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
            groupname ipmp1
            ether 0:3:ba:e3:42:8d
    bge2:1: flags=21000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,STANDBY> mtu 1500 index 4
            inet 10.1.14.140 netmask ffffffc0 broadcast 10.1.14.191
    
    the data address is migrated onto bge2:1. I find this a little confusing. However, I don't know any way around it on Solaris 10. The IPMP Re-architecture makes this a lot easier!

    Using Probe-based IPMP with non-global zones

    Configuring a shared IP Instance non-global zone and utilizing IPMP managed in the global zone is very easy.

    The IPMP configuration is very simple. Interface bge1 is active, and bge2 is in stand-by mode.

    global# more /etc/hostname.bge[12]
    ::::::::::::::
    /etc/hostname.bge1
    ::::::::::::::
    group ipmp1 up
    ::::::::::::::
    /etc/hostname.bge2
    ::::::::::::::
    group ipmp1 standby up
    
    My zone configuration is:
    global# zonecfg -z zone1 info
    zonename: zone1
    zonepath: /zones/zone1
    brand: native
    autoboot: false
    bootargs:
    pool:
    limitpriv:
    scheduling-class:
    ip-type: shared
    inherit-pkg-dir:
            dir: /lib
    inherit-pkg-dir:
            dir: /platform
    inherit-pkg-dir:
            dir: /sbin
    inherit-pkg-dir:
            dir: /usr
    net:
            address: 10.1.14.141/26
            physical: bge1
    
    Prior to booting, the network configuration is:
    global# ifconfig -a4
    lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
            inet 127.0.0.1 netmask ff000000
    lo0:1: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
            zone zone1
            inet 127.0.0.1 netmask ff000000
    bge0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
            inet 129.154.53.125 netmask ffffff00 broadcast 129.154.53.255
            ether 0:3:ba:e3:42:8b
    bge1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
            inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
            groupname ipmp1
            ether 0:3:ba:e3:42:8c
    bge2: flags=21000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,STANDBY> mtu 1500 index 4
            inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
            groupname ipmp1
            ether 0:3:ba:e3:42:8d
    
    After booting, the network looks like this:
    global# ifconfig -a4
    lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
            inet 127.0.0.1 netmask ff000000
    lo0:1: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
            zone zone1
            inet 127.0.0.1 netmask ff000000
    bge0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
            inet 129.154.53.125 netmask ffffff00 broadcast 129.154.53.255
            ether 0:3:ba:e3:42:8b
    bge1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
            inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
            groupname ipmp1
            ether 0:3:ba:e3:42:8c
    bge1:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
            zone zone1
            inet 10.1.14.141 netmask ffffffc0 broadcast 10.1.14.191
    bge2: flags=21000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,STANDBY> mtu 1500 index 4
            inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
            groupname ipmp1
            ether 0:3:ba:e3:42:8d
    

    So a simple case for the use of IPMP, without the need for test addresses! Other IPMP configurations, such as more than two data links, or active-active, are also supported with link based failure detection. The more links involved, the more test addresses are saved with link based failure detection. Since writing this entry I was involved in a customer configuration where this is saving several hundred IP address and their management (such as avoiding duplicate address). That customer is willing to forgo the benefit of probes testing past the local switch port.

    Steffen

    Monday Jan 19, 2009

    IPMP Re-architecture is delivered

    In the process of working on some zones and IPMP testing, I ran into a little difficulty. After probing for some insight, I was reminded by Peter Memishian that the IPMP Re-Architecture (part of Project Clearview) bits were going to be in Nevada/SXCE build 107, and that I could BFU the lastest bits onto an existing Nevada install. Well!!! [For Peter's own perspective of this, see his recent blog.]

    Since I was already playing with build 105 because the Crossbow features are now integrated, I decided to apply the IPMP bits to a 105 installation. [Note: The IPMP Re-architecture is expected to be in Solaris Express Community Edition (SX-CE) build 107 or so (due to be out early Feb 2009), and thus in OpenSolaris 2009.spring (I don't know what its final name will be. Early access to IPS packages for OpenSolaris 2008.11 should appear in the bi-weekly developer repository shortly after SX-CE has the feature included. There is no intention to back port the re-architecture to Solaris 10.]

    I am impressed! The bits worked right away, and once I got used to the slightly different way of monitoring IPMP, I really liked what I saw.

    Being accustomed to using IPMP on Solaris 10 and with Crossbow beta testing previous Nevada bits, I used the long-standing (Solaris 10 and prior) IPMP configuration style I am used to. For my testing, I am using link failure testing only, so no probe addresses are configured. [For examples of the new configuration format, see the section Using the New IPMP Configuration Style below. (15 Feb 2009)]

    global# cat /etc/hostname.bge1
    group shared
    
    global# cat /etc/hostname.bge2
    group shared
    
    global# cat /etc/hostname.bge3
    group shared standby
    
    In my test case bge1 and bge2 are active interfaces, and bge3 is a standby interface.
    global# ifconfig -a4
    bge0: flags=201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS> mtu 1500 index 2
            inet 139.164.63.125 netmask ffffff00 broadcast 139.164.63.255
            ether 0:3:ba:e3:42:8b
    bge1: flags=201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS> mtu 1500 index 3
            inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
            groupname shared
            ether 0:3:ba:e3:42:8c
    bge2: flags=201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS> mtu 1500 index 4
            inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
            groupname shared
            ether 0:3:ba:e3:42:8d
    bge3: flags=261000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,STANDBY,INACTIVE,CoS> mtu 1500 index 5
            inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
            groupname shared
            ether 0:3:ba:e3:42:8e
    ipmp0: flags=8201000842<BROADCAST,RUNNING,MULTICAST,IPv4,CoS,IPMP> mtu 1500 index 6
            inet 0.0.0.0 netmask 0
            groupname shared
    
    You will notice that all three interfaces are up and part of group shared. What is different from the old IPMP is that automatically another interface was created, with the flag IPMP. This is the interface that will be used for all the data IP addresses.

    Because I used the old format for the /etc/hostname.\* files, the backward compatibility of the new IPMP automatically created the ipmp0 interface and assigned it a name. If I wish to have control over that name, I must configure IPMP slightly differently. More on that later.

    The new command ipmpstat(1M) is also introduced to get enhanced information regarding the IPMP configuration.

    My test is really about using zones and IPMP, so here is what things look like when I bring up three zones that are also configured the traditional way, with network definitions using the bge interfaces. [Using the new format, I would replace bge with either ipmp0 (keep in mind that 0 (zero) is set dynamically) or shared. For more details on the new format, go to Using the New IPMP Configuration Style below. (15 Feb 2009)]

    global# for i in 1 2 3 \^Jdo\^J zonecfg -z shared${i} info net \^Jdone
    net:
            address: 10.1.14.141/26
            physical: bge1
            defrouter: 10.1.14.129
    net:
            address: 10.1.14.142/26
            physical: bge1
            defrouter: 10.1.14.129
    net:
            address: 10.1.14.143/26
            physical: bge2
            defrouter: 10.1.14.129
    
    After booting the zones, note that the zones' IP addresses are on logical interfaces on ipmp0, not the previous way of being logical interfaces on bge.
    global# ifconfig -a4
    lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
            inet 127.0.0.1 netmask ff000000
    lo0:1: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
            zone shared1
            inet 127.0.0.1 netmask ff000000
    lo0:2: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
            zone shared2
            inet 127.0.0.1 netmask ff000000
    lo0:3: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
            zone shared3
            inet 127.0.0.1 netmask ff000000
    bge0: flags=201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS> mtu 1500 index 2
            inet 139.164.63.125 netmask ffffff00 broadcast 139.164.63.255
            ether 0:3:ba:e3:42:8b
    bge1: flags=201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS> mtu 1500 index 3
            inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
            groupname shared
            ether 0:3:ba:e3:42:8c
    bge2: flags=201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS> mtu 1500 index 4
            inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
            groupname shared
            ether 0:3:ba:e3:42:8d
    bge3: flags=261000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,STANDBY,INACTIVE,CoS> mtu 1500 index 5
            inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
            groupname shared
            ether 0:3:ba:e3:42:8e
    ipmp0: flags=8201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS,IPMP> mtu 1500 index 6
            zone shared1
            inet 10.1.14.141 netmask ffffffc0 broadcast 10.1.14.191
            groupname shared
    ipmp0:1: flags=8201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS,IPMP> mtu 1500 index 6
            zone shared2
            inet 10.1.14.142 netmask ffffffc0 broadcast 10.1.14.191
    ipmp0:2: flags=8201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS,IPMP> mtu 1500 index 6
            zone shared3
            inet 10.1.14.143 netmask ffffffc0 broadcast 10.1.14.191
    
    For address information, here are the pre and post boot ipmpstat outputs.
    global# ipmpstat -a
    ADDRESS                   STATE  GROUP       INBOUND     OUTBOUND
    0.0.0.0                   down   ipmp0       --          --
    
    global# ipmpstat -a
    ADDRESS                   STATE  GROUP       INBOUND     OUTBOUND
    10.1.14.143               up     ipmp0       bge1        bge2 bge1
    10.1.14.142               up     ipmp0       bge2        bge2 bge1
    10.1.14.141               up     ipmp0       bge1        bge2 bge1
    
    What's really neat is that it shows which interface(s) are used for outbound traffic. A different interface will be selected for each new remote IP address. That is the level of outbound load spreading at this time.
    global# ipmpstat -g
    GROUP       GROUPNAME   STATE     FDT       INTERFACES
    ipmp0       shared      ok        --        bge2 bge1 (bge3)
    
    There is no group difference before or after.
    global# ipmpstat -g
    GROUP       GROUPNAME   STATE     FDT       INTERFACES
    ipmp0       shared      ok        --        bge2 bge1 (bge3)
    
    The FDT column lists the probe-based failure detection time, and is empty since that is disabled in this setup. bge3 is listed third and in parenthesis since that interface is not being used for data traffic at this time.
    global# ipmpstat -i
    INTERFACE   ACTIVE  GROUP       FLAGS     LINK      PROBE     STATE
    bge3        no      ipmp0       is-----   up        disabled  ok
    bge2        yes     ipmp0       -------   up        disabled  ok
    bge1        yes     ipmp0       --mb---   up        disabled  ok
    
    Also, there are no differences for interface status. In both cases bge1 is used from multicast and broadcast traffic, and bge3 is inactive and in standby mode.
    global# ipmpstat -i
    INTERFACE   ACTIVE  GROUP       FLAGS     LINK      PROBE     STATE
    bge3        no      ipmp0       is-----   up        disabled  ok
    bge2        yes     ipmp0       -------   up        disabled  ok
    bge1        yes     ipmp0       --mb---   up        disabled  ok
    
    The probe and target output is uninteresting in this setup as I don't have probe based failure detection on. I am including them for completeness.
    global# ipmpstat -p
    ipmpstat: probe-based failure detection is disabled
    
    global# ipmpstat -t
    INTERFACE   MODE      TESTADDR            TARGETS
    bge3        disabled  --                  --
    bge2        disabled  --                  --
    bge1        disabled  --                  --
    
    So lets see what happens on a link 'failure' as I turn of the switch port going to bge1.

    On the console, the indication is a link failure.

    Jan 15 14:49:07 global in.mpathd[210]: The link has gone down on bge1
    Jan 15 14:49:07 global in.mpathd[210]: IP interface failure detected on bge1 of group shared
    
    The various ipmpstat outputs reflect the failure of bge1 and failover to to bge3, which had been in standby mode, and to bge2. I had expected both IP addresses to end up on bge3. Instead, IPMP determines how to best spread the IPs across the available interfaces.

    The address output shows that .141 and .143 are now on bge3.

    global# ipmpstat -a
    ADDRESS                   STATE  GROUP       INBOUND     OUTBOUND
    10.1.14.143               up     ipmp0       bge3        bge3 bge2
    10.1.14.142               up     ipmp0       bge2        bge3 bge2
    10.1.14.141               up     ipmp0       bge2        bge3 bge2
    
    The group status has changed, with bge1 now shown in brackets as it is in failed mode.
    global# ipmpstat -g
    GROUP       GROUPNAME   STATE     FDT       INTERFACES
    ipmp0       shared      degraded  --        bge3 bge2 [bge1]
    
    The interface status makes it clear that bge1 is down. Broadcast and multicast is now handed by bge2.
    global# ipmpstat -i
    INTERFACE   ACTIVE  GROUP       FLAGS     LINK      PROBE     STATE
    bge3        yes     ipmp0       -s-----   up        disabled  ok
    bge2        yes     ipmp0       --mb---   up        disabled  ok
    bge1        no      ipmp0       -------   down      disabled  failed
    
    As expected, the only difference in the ifconfig output is for bge1, showing that it is in failed state. The zones are continue to shown using the ipmp0 interface. This took me a little bit of getting used to. Before, ifconfig was sufficient to fully see what the state is. Now, I must use ipmpstat as well.

    global# ifconfig -a4
    ...
    bge1: flags=211000803<UP,BROADCAST,MULTICAST,IPv4,FAILED,CoS> mtu 1500 index 3
            inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
            groupname shared
            ether 0:3:ba:e3:42:8c
    bge2: flags=201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS> mtu 1500 index 4
            inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
            groupname shared
            ether 0:3:ba:e3:42:8d
    bge3: flags=221000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,STANDBY,CoS> mtu 1500 index 5
            inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
            groupname shared
            ether 0:3:ba:e3:42:8e
    ipmp0: flags=8201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS,IPMP> mtu 1500 index 6
            zone shared1
            inet 10.1.14.141 netmask ffffffc0 broadcast 10.1.14.191
            groupname shared
    ipmp0:1: flags=8201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS,IPMP> mtu 1500 index 6
            zone shared2
            inet 10.1.14.142 netmask ffffffc0 broadcast 10.1.14.191
    ipmp0:2: flags=8201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS,IPMP> mtu 1500 index 6
            zone shared3
            inet 10.1.14.143 netmask ffffffc0 broadcast 10.1.14.191
    
    "Repairing" the interface, things return to normal.
    Jan 15 15:13:03 global in.mpathd[210]: The link has come up on bge1
    Jan 15 15:13:03 global in.mpathd[210]: IP interface repair detected on bge1 of group shared
    
    Note here only one IP address ended up getting moved back to bge1.
    global# ipmpstat -a
    ADDRESS                   STATE  GROUP       INBOUND     OUTBOUND
    10.1.14.143               up     ipmp0       bge1        bge2 bge1
    10.1.14.142               up     ipmp0       bge2        bge2 bge1
    10.1.14.141               up     ipmp0       bge2        bge2 bge1
    
    Interface bge3 is back in standby mode.
    global# ipmpstat -g
    GROUP       GROUPNAME   STATE     FDT       INTERFACES
    ipmp0       shared      ok        --        bge2 bge1 (bge3)
    
    All three interfaces are up, only two are active, and broadcast and multicast stayed on bge2 (no need to change that now).
    global# ipmpstat -i
    INTERFACE   ACTIVE  GROUP       FLAGS     LINK      PROBE     STATE
    bge3        no      ipmp0       is-----   up        disabled  ok
    bge2        yes     ipmp0       --mb---   up        disabled  ok
    bge1        yes     ipmp0       -------   up        disabled  ok
    
    As a further example of rebalancing of the IP address, here is what happens with four IP addresses spread across two interfaces.
    global# ipmpstat -a
    ADDRESS                   STATE  GROUP       INBOUND     OUTBOUND
    10.1.14.144               up     ipmp0       bge2        bge2 bge1
    10.1.14.143               up     ipmp0       bge1        bge2 bge1
    10.1.14.142               up     ipmp0       bge2        bge2 bge1
    10.1.14.141               up     ipmp0       bge1        bge2 bge1
    
    Jan 15 16:19:09 global in.mpathd[210]: The link has gone down on bge1
    Jan 15 16:19:09 global in.mpathd[210]: IP interface failure detected on bge1 of group shared
    
    global# ipmpstat -a
    ADDRESS                   STATE  GROUP       INBOUND     OUTBOUND
    10.1.14.144               up     ipmp0       bge2        bge3 bge2
    10.1.14.143               up     ipmp0       bge3        bge3 bge2
    10.1.14.142               up     ipmp0       bge2        bge3 bge2
    10.1.14.141               up     ipmp0       bge3        bge3 bge2
    
    Jan 15 18:11:35 global in.mpathd[210]: The link has come up on bge1
    Jan 15 18:11:35 global in.mpathd[210]: IP interface repair detected on bge1 of group shared
    
    global# ipmpstat -a
    ADDRESS                   STATE  GROUP       INBOUND     OUTBOUND
    10.1.14.144               up     ipmp0       bge2        bge2 bge1
    10.1.14.143               up     ipmp0       bge1        bge2 bge1
    10.1.14.142               up     ipmp0       bge2        bge2 bge1
    10.1.14.141               up     ipmp0       bge1        bge2 bge1
    
    There is even spreading of the IP addresses across any two active interfaces.

    Using the New IPMP Configuration Style

    In the previous examples, I used the old style of configuring IPMP with the /etc/hostname.xyzN files. Those files should work on all older versions of Solaris as well as with the re-architecture bits. This section briefly covers the new format.

    A new file that is introduced is the hostname.ipmp-group configuration file. It must follow the same format as any other data link configuration, ASCII characters followed by a number. I will use the same group name as above; however, I have to add a number to the end--thus the group name will be shared0. If you don't have the trailing number, the old style of IPMP setup will be used.

    I create a file to define the IPMP group. Note that it contains only the keyword ipmp.

    global# cat /etc/hostname.shared0
    ipmp
    
    The other files for the NICs reference the IPMP group name.

    global# cat /etc/hostname.bge1
    group shared0 up
    
    global# cat /etc/hostname.bge2
    group shared0 up
    
    global# cat /etc/hostname.bge3
    group shared0 standby up
    
    One note that may not be obvious. I am not using the keyword -failover as I am not using test addresses. Thus the interfaces are also not listed as deprecated in the ifconfig output.

    global# ifconfig -a4
    lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
            inet 127.0.0.1 netmask ff000000
    shared0: flags=8201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS,IPMP> mtu 1500 index 2
            inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
            groupname shared0
    bge0: flags=201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS> mtu 1500 index 3
            inet 139.164.63.125 netmask ffffff00 broadcast 139.164.63.255
            ether 0:3:ba:e3:42:8b
    bge1: flags=201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS> mtu 1500 index 4
            inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
            groupname shared0
            ether 0:3:ba:e3:42:8c
    bge2: flags=201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS> mtu 1500 index 5
            inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
            groupname shared0
            ether 0:3:ba:e3:42:8d
    bge3: flags=261000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,STANDBY,INACTIVE,CoS> mtu 1500 index 6
            inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
            groupname shared0
            ether 0:3:ba:e3:42:8e
    
    After booting the zones, which are still configured to use bge1 or bge2, things look like this.
    global# ifconfig -a4
    lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
            inet 127.0.0.1 netmask ff000000
    lo0:1: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
            zone shared1
            inet 127.0.0.1 netmask ff000000
    lo0:2: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
            zone shared2
            inet 127.0.0.1 netmask ff000000
    lo0:3: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
            zone shared3
            inet 127.0.0.1 netmask ff000000
    shared0: flags=8201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS,IPMP> mtu 1500 index 2
            inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
            groupname shared0
    shared0:1: flags=8201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS,IPMP> mtu 1500 index 2
            zone shared1
            inet 10.1.14.141 netmask ffffffc0 broadcast 10.1.14.191
    shared0:2: flags=8201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS,IPMP> mtu 1500 index 2
            zone shared2
            inet 10.1.14.142 netmask ffffffc0 broadcast 10.1.14.191
    shared0:3: flags=8201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS,IPMP> mtu 1500 index 2
            zone shared3
            inet 10.1.14.143 netmask ffffffc0 broadcast 10.1.14.191
    bge0: flags=201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS> mtu 1500 index 3
            inet 139.164.63.125 netmask ffffff00 broadcast 139.164.63.255
            ether 0:3:ba:e3:42:8b
    bge1: flags=201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS> mtu 1500 index 4
            inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
            groupname shared0
            ether 0:3:ba:e3:42:8c
    bge2: flags=201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS> mtu 1500 index 5
            inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
            groupname shared0
            ether 0:3:ba:e3:42:8d
    bge3: flags=261000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,STANDBY,INACTIVE,CoS> mtu 1500 index 6
            inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
            groupname shared0
            ether 0:3:ba:e3:42:8e
    
    global# ipmpstat -a
    ADDRESS                   STATE  GROUP       INBOUND     OUTBOUND
    10.1.14.143               up     shared0     bge1        bge2 bge1
    10.1.14.142               up     shared0     bge2        bge2 bge1
    10.1.14.141               up     shared0     bge1        bge2 bge1
    0.0.0.0                   up     shared0     --          --
    
    global# ipmpstat -g
    GROUP       GROUPNAME   STATE     FDT       INTERFACES
    shared0     shared0     ok        --        bge2 bge1 (bge3)
    
    global# ipmpstat -i
    INTERFACE   ACTIVE  GROUP       FLAGS     LINK      PROBE     STATE
    bge3        no      shared0     is-----   up        disabled  ok
    bge2        yes     shared0     -------   up        disabled  ok
    bge1        yes     shared0     --mb---   up        disabled  ok
    
    Things are the same as before, except that the I now have specified the IPMP group name (shared0 instead of the previous ipmp0). I find this very useful as the name can help identify the purpose, and when debugging, different IPMP group names using context appropriate text should be very helpful.

    I find the integration, or rather the backward compatibility, great. Not only will the old or existing IPMP setup work, the existing zonecfg network setup works as well. This means the same configuration files will work pre- and post-re-architecture!

    Let's take a look at how things look within a zone.

    shared1# ifconfig -a4
    lo0:1: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
            inet 127.0.0.1 netmask ff000000
    shared0:1: flags=8201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS,IPMP> mtu 1500 index 2
            inet 10.1.14.141 netmask ffffffc0 broadcast 10.1.14.191
    
    shared1# netstat -rnf inet
    
    Routing Table: IPv4
      Destination           Gateway           Flags  Ref     Use     Interface
    -------------------- -------------------- ----- ----- ---------- ---------
    default              10.1.14.129          UG        1          2 shared0
    10.1.14.128          10.1.14.141          U         1          0 shared0:1
    127.0.0.1            127.0.0.1            UH        1         33 lo0:1
    
    The zone's network is on the link shared0 using a logical IP, and everything else looks as it has always looked. This output is actually while bge1 is down. IPMP hides all the details in the non-global zone.

    Using Probe-based Failover

    The configurations so far have been with link-based failure detection. IPMP has the ability to do probe-based failure detection, where ICMP packet are sent to other nodes on the system. This allows for failure detection way beyond what link-based detection can do, including the whole switch, and items past it up to and including routers. In order to use probe-based failure detection, test addresses are required on the physical NICs. For my configuration, I use test addresses on a completely different subnet, and my router is another system running Solaris 10. The router happens to be a zone with two NICs and configured as an exclusive IP Instance.

    I am using a completely different subnet as I want to isolate the global zone from the non-global zones, and the setup is also using the defrouter zonecfg option, and I don't want to interfere with that setup.

    The IPMP setup is as follows. I have added test addresses on the 172.16.10.0/24 subnet, and the interfaces are set to not fail over.

    global# cat /etc/hostname.shared0
    ipmp
    
    global# cat /etc/hostname.bge1
    172.16.10.141/24 group shared0 -failover up
    
    global# cat /etc/hostname.bge2
    172.16.10.142/24 group shared0 -failover up
    
    global# cat /etc/hostname.bge3
    172.16.10.143/24 group shared0 -failover standby up
    
    This is the state of the system before bringing up any zones.
    global# ifconfig -a4
    lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
            inet 127.0.0.1 netmask ff000000
    shared0: flags=8201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS,IPMP> mtu 1500 index 2
            inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
            groupname shared0
    bge0: flags=201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS> mtu 1500 index 3
            inet 139.164.63.125 netmask ffffff00 broadcast 139.164.63.255
            ether 0:3:ba:e3:42:8b
    bge1: flags=209040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER,CoS> mtu 1500 index 4
            inet 172.16.10.141 netmask ffffff00 broadcast 172.16.10.255
            groupname shared0
            ether 0:3:ba:e3:42:8c
    bge2: flags=209040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER,CoS> mtu 1500 index 5
            inet 172.16.10.142 netmask ffffff00 broadcast 172.16.10.255
            groupname shared0
            ether 0:3:ba:e3:42:8d
    bge3: flags=269040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER,STANDBY,INACTIVE,CoS> mtu 1500 index 6
            inet 172.16.10.143 netmask ffffff00 broadcast 172.16.10.255
            groupname shared0
            ether 0:3:ba:e3:42:8e
    
    The ipmpstat output is different now.
    global# ipmpstat -a
    ADDRESS                   STATE  GROUP       INBOUND     OUTBOUND
    0.0.0.0                   up     shared0     --          --
    
    global# ipmpstat -g
    GROUP       GROUPNAME   STATE     FDT       INTERFACES
    shared0     shared0     ok        10.00s    bge2 bge1 (bge3)
    
    global# ipmpstat -i
    INTERFACE   ACTIVE  GROUP       FLAGS     LINK      PROBE     STATE
    bge3        no      shared0     is-----   up        ok        ok
    bge2        yes     shared0     -------   up        ok        ok
    bge1        yes     shared0     --mb---   up        ok        ok
    
    The Failure Detection Time is now set. And the probe information option lists an ongoing update of the probe results.
    global# ipmpstat -p
    TIME      INTERFACE   PROBE  NETRTT    RTT       RTTAVG    TARGET
    0.14s     bge3        426    0.48ms    0.56ms    0.68ms    172.16.10.16
    0.24s     bge2        426    0.50ms    0.98ms    0.74ms    172.16.10.16
    0.26s     bge1        424    0.42ms    0.71ms    1.72ms    172.16.10.16
    1.38s     bge1        425    0.42ms    0.50ms    1.57ms    172.16.10.16
    1.79s     bge2        427    0.54ms    0.86ms    0.76ms    172.16.10.16
    1.93s     bge3        427    0.45ms    0.53ms    0.66ms    172.16.10.16
    2.79s     bge1        426    0.38ms    0.56ms    1.44ms    172.16.10.16
    2.85s     bge2        428    0.34ms    0.41ms    0.71ms    172.16.10.16
    3.15s     bge3        428    0.44ms    4.55ms    1.14ms    172.16.10.16
    \^C
    
    The target information option shows the current probe targets.
    global# ipmpstat -t
    INTERFACE   MODE      TESTADDR            TARGETS
    bge3        multicast 172.16.10.143       172.16.10.16
    bge2        multicast 172.16.10.142       172.16.10.16
    bge1        multicast 172.16.10.141       172.16.10.16
    
    Once the zones are up and running and bge1 is down, the status output changes accordingly.
    global# ipmpstat -a
    ADDRESS                   STATE  GROUP       INBOUND     OUTBOUND
    10.1.14.143               up     shared0     bge2        bge3 bge2
    10.1.14.142               up     shared0     bge3        bge3 bge2
    10.1.14.141               up     shared0     bge2        bge3 bge2
    0.0.0.0                   up     shared0     --          --
    
    global# ipmpstat -g
    GROUP       GROUPNAME   STATE     FDT       INTERFACES
    shared0     shared0     degraded  10.00s    bge3 bge2 [bge1]
    
    global# ipmpstat -i
    INTERFACE   ACTIVE  GROUP       FLAGS     LINK      PROBE     STATE
    bge3        yes     shared0     -s-----   up        ok        ok
    bge2        yes     shared0     --mb---   up        ok        ok
    bge1        no      shared0     -------   down      failed    failed
    
    global# ipmpstat -p
    TIME      INTERFACE   PROBE  NETRTT    RTT       RTTAVG    TARGET
    0.46s     bge2        839    0.43ms    0.98ms    1.17ms    172.16.10.16
    1.15s     bge3        840    0.32ms    0.37ms    0.65ms    172.16.10.16
    1.48s     bge2        840    0.37ms    0.45ms    1.08ms    172.16.10.16
    2.56s     bge3        841    0.45ms    0.54ms    0.63ms    172.16.10.16
    3.17s     bge2        841    0.40ms    0.51ms    1.01ms    172.16.10.16
    3.93s     bge3        842    0.40ms    0.47ms    0.61ms    172.16.10.16
    4.61s     bge2        842    0.63ms    0.75ms    0.98ms    172.16.10.16
    5.17s     bge3        843    0.38ms    0.46ms    0.59ms    172.16.10.16
    5.72s     bge2        843    0.36ms    0.44ms    0.91ms    172.16.10.16
    \^C
    
    global# ipmpstat -t
    INTERFACE   MODE      TESTADDR            TARGETS
    bge3        multicast 172.16.10.143       172.16.10.16
    bge2        multicast 172.16.10.142       172.16.10.16
    bge1        multicast 172.16.10.141       172.16.10.16
    
    Without showing the details here, the non-global zones continue to function.

    Bringing all three interfaces down, things look like this.

    Jan 19 13:51:22 global in.mpathd[61]: The link has gone down on bge2
    Jan 19 13:51:22 global in.mpathd[61]: IP interface failure detected on bge2 of group shared0
    Jan 19 13:52:04 global in.mpathd[61]: The link has gone down on bge3
    Jan 19 13:52:04 global in.mpathd[61]: All IP interfaces in group shared0 are now unusable
    
    global# ipmpstat -a
    ADDRESS                   STATE  GROUP       INBOUND     OUTBOUND
    10.1.14.143               up     shared0     --          --
    10.1.14.142               up     shared0     --          --
    10.1.14.141               up     shared0     --          --
    0.0.0.0                   up     shared0     --          --
    
    global# ipmpstat -g
    GROUP       GROUPNAME   STATE     FDT       INTERFACES
    shared0     shared0     failed    10.00s    [bge3 bge2 bge1]
    
    global# ipmpstat -i
    INTERFACE   ACTIVE  GROUP       FLAGS     LINK      PROBE     STATE
    bge3        no      shared0     -s-----   down      failed    failed
    bge2        no      shared0     -------   down      failed    failed
    bge1        no      shared0     -------   down      failed    failed
    
    global# ipmpstat -p
    \^C
    
    global# ipmpstat -t
    INTERFACE   MODE      TESTADDR            TARGETS
    bge3        multicast 172.16.10.143       --
    bge2        multicast 172.16.10.142       --
    bge1        multicast 172.16.10.141       --
    
    The whole IPMP group shared0 is down, all appropriate ipmpstat output reflects that, and no probes are listed nor probe RTT time reports are updated.

    An additional scenario might be to have two separate paths, and have something other than a link failure force the failover.

    Tuesday Jan 13, 2009

    Using zonecfg defrouter with shared-IP zones

    [Update to IPMP testing 2009.01.20]

    [Minor update 2009.01.14]

    When running Solaris Zones in a shared-IP configuration, all network configurations are determined by how the zone is configured using zonecfg(1M) or by what the global zone's IP determines things should be (such as routes). This has caused some trouble in situations where zones are on different subnets, and especially if the global zone is not on the subnet(s) the non-global zones are on. While exclusive IP Instances were delivered to help address these cases, using exclusive IP Instances requires a data link per zone, and if running a large number of zones there may not be enough data links available.

    With Solaris 10 10/08 (Update 6), an additional network configuration parameter is available for shared-IP zones. This is the default router (defrouter) optional parameter.

    Using the defrouter parameter, it is possible to set which router to use for traffic leaving the zone. In the global zone, default router entries are created the first time the zone is booted. Note that the entries are not deleted when the zone is halted.

    The defrouter property looks like this for a zone with it configured.

    global# zonecfg -z shared1 info net
    net:
            address: 10.1.14.141/26
            physical: bge1
            defrouter: 10.1.14.129
    
    And it looks like this if it is not set.
    global# zonecfg -z shared1 info net
    net:
            address: 10.1.14.141/26
            physical: bge1
            defrouter not specified
    
    So I have run a variety of configurations, and some thing I observed are as follows. (Most of the configurations used a separate interface for the global zone (bge0) than for the non-global zones (bge1 and bge2). IPMP is not being used in these configurations. A comment on that at the end.) The [#] indicate examples in the outputs that follow.
    • A default route entry is create for the NIC [1] on which the zone is configured when the zone is booted. [2]
    • Entries are not deleted when a zone is halted. They persist until manually removed[3] or a reboot of the global zone.
    • It is possible to have the same default router configured for multiple zones. [4]
    • It is possible to have the same default router listed on multiple interfaces. \* [5]
    • It is possible to have multiple default routers on the same interface, even on different IP subnets. [6]
    • The interface used for outbound traffic is the one the zone is assigned to. [7]
    • It is sufficient to plumb the interface for the non-global zones in the global zone (thus it has 0.0.0.0 as its IP address in the global zone). [8]
    • The physical interface can be down in the global zone. [9]
    • If only one interface is used, and different subnets for the global and non-global zones are configured, routing works when setting defrouter [10] and does not work if it is not set.
    The most interesting thing I noticed was that although two non-global zones may be on the same IP subnet, if they are configured on different interfaces, the traffic leaves the system on the interface that the zone is configured to be on. This is not the case typically when using shared IP and also having an IP address for the subnet in the global zone.

    \* Note: Having two interfaces on the same IP subnet without configuring IP Multipathing (IPMP) may not be a supported configuration. I am looking for documentation that states this one way or another. [2009.01.14]

    Examples

    1. Single Zone, Single Interface--The Basics

    Create a single non-global zone.
    global# netstat -rn
    
    Routing Table: IPv4
      Destination           Gateway           Flags  Ref     Use     Interface
    -------------------- -------------------- ----- ----- ---------- ---------
    default              139.164.63.215       UG        1          2 bge0
    139.164.63.0         139.164.63.125       U         1          1 bge0
    224.0.0.0            139.164.63.125       U         1          0 bge0
    127.0.0.1            127.0.0.1            UH        1         42 lo0
    
    global# zonecfg -z shared1 info net
    net:
            address: 10.1.14.141/26
            physical: bge1
            defrouter: 10.1.14.129
    
    global# zoneadm -z shared1 boot [2]
    
    global# netstat -rn
    
    Routing Table: IPv4
      Destination           Gateway           Flags  Ref     Use     Interface
    -------------------- -------------------- ----- ----- ---------- ---------
    default              139.164.63.215       UG        1          2 bge0
    default              10.1.14.129          UG        1          0 bge1 [1]
    139.164.63.0         139.164.63.125       U         1          1 bge0
    224.0.0.0            139.164.63.125       U         1          0 bge0
    127.0.0.1            127.0.0.1            UH        1         42 lo0
    
    global# zoneadm -z shared1 halt
    
    global# zoneadm list -v
      ID NAME             STATUS     PATH                           BRAND    IP
       0 global           running    /                              native   shared
    
    global# netstat -rn
    
    Routing Table: IPv4
      Destination           Gateway           Flags  Ref     Use     Interface
    -------------------- -------------------- ----- ----- ---------- ---------
    default              10.1.14.129          UG        1          0 bge1
    default              139.164.63.215       UG        1          1 bge0
    139.164.63.0         139.164.63.125       U         1          1 bge0
    224.0.0.0            139.164.63.125       U         1          0 bge0
    127.0.0.1            127.0.0.1            UH        1         42 lo0
    
    global# route delete default 10.1.14.129 [3]
    delete net default: gateway 10.1.14.129
    
    global# netstat -rn
    
    Routing Table: IPv4
      Destination           Gateway           Flags  Ref     Use     Interface
    -------------------- -------------------- ----- ----- ---------- ---------
    default              139.164.63.215       UG        1          1 bge0
    139.164.63.0         139.164.63.125       U         1          1 bge0
    224.0.0.0            139.164.63.125       U         1          0 bge0
    127.0.0.1            127.0.0.1            UH        1         42 lo0
    

    2. Multiple Interfaces, Same Default Router

    Three zones, where two use bge1 and the third uses bge2. All use the same default router.
    global# netstat -rn
    
    Routing Table: IPv4
      Destination           Gateway           Flags  Ref     Use     Interface
    -------------------- -------------------- ----- ----- ---------- ---------
    default              139.164.63.215       UG        1          1 bge0
    139.164.63.0         139.164.63.125       U         1          1 bge0
    224.0.0.0            139.164.63.125       U         1          0 bge0
    127.0.0.1            127.0.0.1            UH        1         42 lo0
    
    global# zonecfg -z shared1 info net
    net:
            address: 10.1.14.141/26
            physical: bge1
            defrouter: 10.1.14.129 [4]
    
    global# zonecfg -z shared2 info net
    net:
            address: 10.1.14.142/26
            physical: bge1
            defrouter: 10.1.14.129 [4]
    
    global# zonecfg -z shared3 info net
    net:
            address: 10.1.14.143/26
            physical: bge2
            defrouter: 10.1.14.129 [5]
    
    global# zoneadm -z shared1 boot
    
    global# zoneadm -z shared2 boot
    
    global# zoneadm -z shared3 boot
    
    global# netstat -rn
    
    Routing Table: IPv4
      Destination           Gateway           Flags  Ref     Use     Interface
    -------------------- -------------------- ----- ----- ---------- ---------
    default              10.1.14.129          UG        1          0 bge1 [4]
    default              139.164.63.215       UG        1          1 bge0
    default              10.1.14.129          UG        1          2 bge2 [5]
    139.164.63.0         139.164.63.125       U         1          1 bge0
    224.0.0.0            139.164.63.125       U         1          0 bge0
    127.0.0.1            127.0.0.1            UH        1         42 lo0
    
    global# zoneadm list -v
      ID NAME             STATUS     PATH                           BRAND    IP
       0 global           running    /                              native   shared
       3 shared1          running    /zones/shared1                 native   shared
       4 shared2          running    /zones/shared2                 native   shared
       5 shared3          running    /zones/shared3                 native   shared
    
    global# ifconfig -a4
    lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
            inet 127.0.0.1 netmask ff000000
    lo0:1: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
            zone shared1
            inet 127.0.0.1 netmask ff000000
    lo0:2: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
            zone shared2
            inet 127.0.0.1 netmask ff000000
    lo0:3: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
            zone shared3
            inet 127.0.0.1 netmask ff000000
    bge0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
            inet 139.164.63.125 netmask ffffff00 broadcast 139.164.63.255
            ether 0:3:ba:e3:42:8b
    bge1: flags=1000842<BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
            inet 0.0.0.0 netmask 0
            ether 0:3:ba:e3:42:8c
    bge1:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
            zone shared1
            inet 10.1.14.141 netmask ffffffc0 broadcast 10.1.14.191
    bge1:2: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
            zone shared2
            inet 10.1.14.142 netmask ffffffc0 broadcast 10.1.14.191
    bge2: flags=1000842<BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 4
            inet 0.0.0.0 netmask 0
            ether 0:3:ba:e3:42:8d
    bge2:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 4
            zone shared3
            inet 10.1.14.143 netmask ffffffc0 broadcast 10.1.14.191
    

    3. Multiple Subnets

    Add another zone, using bge2 and on a different subnet.
    global# zonecfg -z shared4 info net
    net:
            address: 192.168.16.144/24
            physical: bge2
            defrouter: 192.168.16.129
    
    global# zoneadm -z shared4 boot
    
    global# netstat -rn
    
    Routing Table: IPv4
      Destination           Gateway           Flags  Ref     Use     Interface
    -------------------- -------------------- ----- ----- ---------- ---------
    default              10.1.14.129          UG        1          0 bge1
    default              10.1.14.129          UG        1          4 bge2
    default              139.164.63.215       UG        1          3 bge0
    default              192.168.16.129       UG        1          0 bge2 [6]
    139.164.63.0         139.164.63.125       U         1          4 bge0
    224.0.0.0            139.164.63.125       U         1          0 bge0
    127.0.0.1
    

    4. Interface Usage

    Issue some pings within the non-global zones and see which network interfaces are used. From the global zone, I issue a ping to a remote system (on the same network as the global zone (139.164.63.0), and see which interfaces are being used. [7]
    global# zlogin shared1 ping 139.164.63.38
    139.164.63.38 is alive
    
    global# zlogin shared2 ping 139.164.63.38
    139.164.63.38 is alive
    
    global# zlogin shared3 ping 139.164.63.38
    139.164.63.38 is alive
    
    global# zlogin shared4 ping 139.164.63.38
    139.164.63.38 is alive
    
    This shows the pings originating from shared1 and shared2 going out on bge1.
    global1# snoop -d bge1 icmp
    Using device /dev/bge1 (promiscuous mode)
     10.1.14.141 -> 139.164.63.38 ICMP Echo request (ID: 4677 Sequence number: 0)
    139.164.63.38 -> 10.1.14.141  ICMP Echo reply (ID: 4677 Sequence number: 0)
     10.1.14.142 -> 139.164.63.38 ICMP Echo request (ID: 4681 Sequence number: 0)
    139.164.63.38 -> 10.1.14.142  ICMP Echo reply (ID: 4681 Sequence number: 0)
    
    And this shows the pings originating from shared3 and shared4 going out on bge2.
    global2# snoop -d bge2 icmp
    Using device /dev/bge2 (promiscuous mode)
     10.1.14.143 -> 139.164.63.38 ICMP Echo request (ID: 4685 Sequence number: 0)
    139.164.63.38 -> 10.1.14.143  ICMP Echo reply (ID: 4685 Sequence number: 0)
    192.168.16.144 -> 139.164.63.38 ICMP Echo request (ID: 4689 Sequence number: 0)
    139.164.63.38 -> 192.168.16.144 ICMP Echo reply (ID: 4689 Sequence number: 0)
    
    Just to confirm where each zone is configured, here is the ifconfig output.
    global# ifconfig -a4
    lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
            inet 127.0.0.1 netmask ff000000
    lo0:1: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
            zone shared1
            inet 127.0.0.1 netmask ff000000
    lo0:2: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
            zone shared2
            inet 127.0.0.1 netmask ff000000
    lo0:3: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
            zone shared3
            inet 127.0.0.1 netmask ff000000
    lo0:4: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
            zone shared4
            inet 127.0.0.1 netmask ff000000
    bge0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
            inet 139.164.63.125 netmask ffffff00 broadcast 139.164.63.255
            ether 0:3:ba:e3:42:8b
    bge1: flags=1000842<BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3 [9]
            inet 0.0.0.0 netmask 0 [8]
            ether 0:3:ba:e3:42:8c
    bge1:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
            zone shared1
            inet 10.1.14.141 netmask ffffffc0 broadcast 10.1.14.191
    bge1:2: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
            zone shared2
            inet 10.1.14.142 netmask ffffffc0 broadcast 10.1.14.191
    bge2: flags=1000842<BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 4
            inet 0.0.0.0 netmask 0 [8]
            ether 0:3:ba:e3:42:8d
    bge2:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 4
            zone shared3
            inet 10.1.14.143 netmask ffffffc0 broadcast 10.1.14.191
    bge2:2: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 4
            zone shared4
            inet 192.168.16.144 netmask ffffff00 broadcast 192.168.16.255
    

    5. Using a Single Interface

    Only using bge0 and using different subnets for the global and non-global zones. [10]

    Before booting the zone.

    global# netstat -nr
    
    Routing Table: IPv4
      Destination           Gateway           Flags  Ref     Use     Interface
    -------------------- -------------------- ----- ----- ---------- ---------
    default              139.164.63.215       UG        1          2 bge0
    139.164.63.0         139.164.63.125       U         1          2 bge0
    224.0.0.0            139.164.63.125       U         1          0 bge0
    127.0.0.1            127.0.0.1            UH        1         42 lo0
    
    global# zonecfg -z shared17 info net
    net:
            address: 192.168.17.147/24
            physical: bge0
            defrouter: 192.168.17.16
    
    global# zoneadm -z shared17 boot
    
    Once the zone is booted, netstat shows both default routes, and a ping from the zone works.
    global# netstat -rn
    
    Routing Table: IPv4
      Destination           Gateway           Flags  Ref     Use     Interface
    -------------------- -------------------- ----- ----- ---------- ---------
    default              139.164.63.215       UG        1          2 bge0
    default              192.168.17.16        UG        1          0 bge0
    139.164.63.0         139.164.63.125       U         1          2 bge0
    224.0.0.0            139.164.63.125       U         1          0 bge0
    127.0.0.1            127.0.0.1            UH        1         42 lo0
    
    global# ifconfig -a4
    lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
            inet 127.0.0.1 netmask ff000000
    lo0:1: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
            zone shared17
            inet 127.0.0.1 netmask ff000000
    bge0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
            inet 139.164.63.125 netmask ffffff00 broadcast 139.164.63.255
            ether 0:3:ba:e3:42:8b
    bge0:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
            zone shared17
            inet 192.168.17.147 netmask ffffff00 broadcast 192.168.17.255
    
    global# zlogin shared17 ping 139.164.63.38
    139.164.63.38 is alive
    

    IP Multipathing (IPMP)

    I did some testing with IPMP and similar examples as above. At this time the combination of IPMP and the defrouter configuration does not work. I have filed bug 6792116 to have this looked at.

    [Updated 2009.01.20] After some addtional testing, especially with test addresses and probe based failure detection, I have seen IPMP work well only when zones are configured such that at least one zone is on each NIC in an IPMP group, including a standby NIC. For example, if you have two NICs, bge1 and bge2, at least one zone must be configured on bge1 and at least one on bge2. This is even the case when one of the NICs is in failed mode when the system or zone(s) boot. It turns out that the default route is added when the zone boot, and there is no later check for default route requirements as a zone is moved from one NIC to another based on IPMP failover or failback. Thus, I would recommend not using defrouter and IPMP together until the conbination is confirmed to work.

    If this is important for your deployments, please add a service record to change request 6792116 and work with your service provide to have this addressed. Please also note that this works well with the IPMP Re-architecture coming soon to OpenSolaris.

    Thursday Jan 08, 2009

    Crossbow is delivered--Traveling VNICs and more

    With Solaris Express Community Edition build 105, the initial implementation of Network Virtualization and Resource Control, known as Project Crossbow, is delivered into the main networking code base and available in the distributed images. No need to install additional software! The multi-year effort has reached a major milestone.

    The feature I have been waiting for the most is the virtual NICs (VNICs). This allows me to create multiple data links using a single physical network interface, such as on my laptop. Each data link can be assigned to a different zone, and with exclusive IP Instance zones, each zone can have separate IP management and characteristics. The most useful one for me is to have one zone working on the native local network, and another zone with IPsec enabled, for a VPN connection.

    Previously, I have demonstrated how to do this with two NICs and with one NIC and VNICs. I also have an example of how to achieve this with VNANs.

    Now that Crossbow is integrated, things are much simpler!

    Some Specifics

    First thing I did was create a VNIC. Note that the dladm(1M) commands have changed slightly, both general and for VNICs. To see what physical NICs are available. On my laptop it looks like this. (The option used to be show-dev.)
    global# dladm show-phys
    LINK         MEDIA                STATE      SPEED  DUPLEX    DEVICE
    ath0         WiFi                 down       0      unknown   ath0
    bge0         Ethernet             up         1000   full      bge0
    
    Data links are the entities that can be assigned to a zone, so lets see those.
    global# dladm show-link
    LINK        CLASS    MTU    STATE    OVER
    ath0        phys     1500   down     --
    bge0        phys     1500   up       --
    
    Now I create a VNIC.
    global# dladm create-vnic -l bge0 vpn0
    
    global# dladm show-link
    LINK        CLASS    MTU    STATE    OVER
    ath0        phys     1500   down     --
    bge0        phys     1500   up       --
    vpn0        vnic     1500   up       bge0
    
    I used the basic create-vnic format, where I only specified the option over which device to create the VNIC. I let Solaris determine the MAC address, and I did not assign any other properties to the VNIC. The name for a data link must start with characters and end with a number. Thus I chose vpn0 to make it clear to me what I want to use it for. I could have called it vpn123456789, showing that the number part can be quite large.

    I now create a zone, and I chose the following configuration.

    global# zonecfg -z vpn info
    zonename: vpn
    zonepath: /zones/vpn
    brand: native
    autoboot: false
    bootargs:
    pool:
    limitpriv:
    scheduling-class:
    ip-type: exclusive
    inherit-pkg-dir:
            dir: /lib
    inherit-pkg-dir:
            dir: /platform
    inherit-pkg-dir:
            dir: /sbin
    inherit-pkg-dir:
            dir: /usr
    net:
            address not specified
            physical: vpn0
            defrouter not specified
    
    Key items are in bold. The zone is an exclusve IP Instance zone, and I only assigned the vpn0 data link to it. The zone is a sparse zone, and the need to inherit an extra directory for IPsec to work is no longer required (I was curious whether this had been fixed.)

    After installing (I made a clone of an existing zone) and before booting the zone, I copied into the zone a customized sysidcfg file.

    global# cat /zones/vpn/root/etc/sysidcfg
    system_locale=C
    terminal=xterm
    network_interface=PRIMARY {
            dhcp
            protocol_ipv6=no
    }
    nfs4_domain=dynamic
    security_policy=NONE
    name_service=NONE
    timezone=US/Eastern
    service_profile=limited_net
    timeserver=localhost
    root_password=YyDStVVvtZX6.
    
    Upon booting, the zone gets an IP address via DHCP. This will be useful for being on a variety of networks. When using wireless, I won't have to change the zone's configuration. I will, however, have to recreate vpn0 on top of ath0.

    Now I can happily be on a public and the corporate network at the same time. This example has me using the non-global zone to run VPN within. However, depending on my needs at the moment, I could have the global zone be VPNed in, and the non-global zone be on the public network. It is just a matter of where I run the VPN software.

    global# ifconfig -a4
    lo0: flags=2001000849 mtu 8232 index 1
            inet 127.0.0.1 netmask ff000000
    ath0: flags=201000802 mtu 1500 index 2
            inet 0.0.0.0 netmask 0
            ether 0:b:6b:80:bc:59
    bge0: flags=201004843 mtu 1500 index 3
            inet 192.168.15.104 netmask ffffff00 broadcast 192.168.15.255
            ether 0:c0:9f:5b:43:33
    
    vpn# ifconfig -a4
    lo0: flags=2001000849 mtu 8232 index 1
            inet 127.0.0.1 netmask ff000000
    vpn0: flags=201004843 mtu 1500 index 2
            inet 192.168.15.105 netmask ffffff00 broadcast 192.168.15.255
            ether 2:8:20:86:53:e3
    ip.tun0: flags=10010008d1 mtu 1366 index 3
            inet tunnel src 192.168.15.105 tunnel dst 192.168.101.183
            tunnel security settings  -->  use 'ipsecconf -ln -i ip.tun0'
            tunnel hop limit 60
            inet 192.168.48.27 --> 192.168.76.43 netmask ffffffff
    
    This demonstrates one of the features of Crossbow. I will now be able to do a lot more with zones, while taking advantage of IP Instances, without needing multiple NICs. This is great for customer demos. I have not covered items such as the virtual switch that is created, or the ability to snoop traffic between zones now, or all the resource monitoring and controls that Crossbow offers. More on that elsewhere and in the future.

    P.S. Crossbow affects and works with a lot of the generic LAN driver (GLD) framework, and delivers a new MAC interface, utilizes improvements in dladm, data link naming (vanity naming from Project Clearview), and lots more, and thus is a lot of code changes. There is a high level of interest in getting the VNIC features into Solaris 10. If you have a strong need for that, please add a Service Record using your support channel to Change Request 6790102.

    Wednesday Mar 26, 2008

    How to BFU a System

    Sometimes you want to try out a new feature not yet delivered into Solaris Nevada, and you have apply binaries using BFU. I imagine if you do this all the time, you know all the tricks and gotchas. I don't do it often enough and sometimes get caught up in some details. So here are the steps I tend to use.

    First, get the latest BFU package from the ON (OS/Net) Consolidation. I typically only use the SUNWonbld tar file for my hardware.

    Download the bits you want to install, such as those for Crossbow Beta or Clearview's snoop on loopback

    To make life a little simpler, I add the following to root's .profile file.

    if [ -d /opt/onbld ]
    then
       FASTFS=/opt/onbld/bin/`uname -p`/fastfs ; export FASTFS
       BFULD=/opt/onbld/bin/`uname -p`/bfuld ; export BFULD
       GZIPBIN=/usr/bin/gzip ; export GZIPBIN
       PATH=$PATH:/opt/onbld/bin
    fi
    

    Now to apply the bits. After unpacking the bits into a temporary location, lets say /tmp/bfu, install the onbld package.

    # pkgadd -d onbld all
    
    Processing package instance  from 
    
    OS-Net Build Tools(sparc) 11.11,REV=2008.03.18.14.39
    Copyright 2008 Sun Microsystems, Inc.  All rights reserved.
    Use is subject to license terms.
    
    ...
    
    Installation of  was successful.
    #
    
    I re-read my .profile, and verify that the necessary BFU variables are set
    # . /.profile
    # echo $FASTFS
    /opt/onbld/bin/sparc/fastfs
    
    Now apply the BFU (this one is for Crossbow beta). You must use the full pathname!

    Note: you may want to do this from the console, in case you loose your network connection.

    # bfu `pwd`/nightly-nd
    Copying /opt/onbld/bin/bfu to /tmp/bfu.1000
    Executing /tmp/bfu.1000 /tmp/bfu/nightly-nd
    
    ...
    
    Entering post-bfu protected environment (shell: ksh).
    Edit configuration files as necessary, then reboot.
    
    bfu#
    
    Note that you end up in the BFU shell. Now issue an automatic conflict resolution check.
    bfu# /opt/onbld/bin/acr
    Getting ACR information from /tmp/bfu/nightly-nd... ok
    
    updating //platform/sun4v/boot_archive
    Finished.  See /tmp/acr.nhaqVi/allresults for complete log.
    bfu#
    
    bfu# exit
    Exiting post-bfu protected environment.  To reenter, type:
    LD_NOAUXFLTR=1 LD_LIBRARY_PATH=/tmp/bfulib LD_LIBRARY_PATH_64=/tmp/bfulib/64 
    PATH=/tmp/bfubin /tmp/bfubin/ksh
    #
    
    Its time to reboot and run with the new bits!

    Monday Mar 10, 2008

    Use Cases for Network Virtualization and Resource Control (Project Crossbow)

    Network Virtualization and Resource Control, more often referred to as Project Crossbow, is in beta starting today. Some may wonder whether they should try the beta code, and if so, how to show the benefits Crossbow delivers. Here is a list of some use cases for Crossbow.

    Network Virtualization

    Requirement: You need more NICs than are installed or supported on the system. Use zones with exclusive IP Instance, but share a single NIC or small number of NICs.

    Feature: Any crossbow supported NIC can now be split up into several VNICs, and those VNICs can be assigned to different zones. Optionally, resource management can be applied to any or all VNICs.

    Benefit: Zones that need network administrative isolation can share a single NIC. Traffic between zones with exclusive IP Instances can be contained within the system if the zones use VNICs on the same NIC. Resource management can be used to limit CPU or network bandwidth associated with a zone by applying controls on a VNIC.

    How to Demonstrate:

    • create zones if they don't exist
    • configure zones as ip-type=exclusive
    • create VNICs
    • assign VNICs to zones
    • boot zones
    • observe distributed traffic
    • optionally apply resource controls and observe
    or
    • create VNICs
    • assign IP addresses to VNICs
    • run services bound to separate IP addresses
    • observe distributed traffic
    • optionally apply resource controls and observe

    Network Traffic Observability

    Requirement: Need to measure and monitor network traffic for different services on the system.

    Feature: Bytes and packets received and transmitted can be counted and monitored.

    Benefit: Better understanding of network traffic patterns, and potential data points to make future resource control decisions. Opportunity to do chargeback based on network usage.

    How to Demonstrate:

    • create one or more VNICs using dladm
    • create one or more flows using flowadm
    • show data in real-time using dladm or flowadm
    • show historical data
    • show for data link/NIC, VNIC, and flow

    Network Resource Management

    Requirement: Limit the amount of network bandwidth used by a service. Control which CPU(s) are used to process network traffic for a service.

    Feature: Limits on the maximum network traffic in bits/second can be set. Network traffic processing can be directed to one or more CPUs, providing for better response time for the network stack, or insuring that network stack processing will not interfere with other resource consumers on the system.

    Benefit: Finer control of resource utilization. Ability to set quality of service. Prevention of resource starvation by competing consumers. Denial of Service attack defense.

    How to Demonstrate:

    • create one or more VNICs using dladm
    • create one or more flows using flowadm
    • set bandwidth caps on VNICs or flows
    • set CPU binding on VNICs or flows
    • see limits enforced under heavy network load by observing the application(s)' data throughput, for example, metrics from
      • wget
      • ftp
      • dladm
      • flowadm statistics
      • your own application metric(s)
    • show different CPU utilization or distribution using mpstat

    Note: bandwidth guarantees are not available at this time.

    Network Performance Improvements

    Requirement: Faster network processing. More efficient network processing.

    Features: Improved datagram processing within the IP stack. Automatic switching between interrupt and polling to speed packet processing and remove interrupt overhead.

    Benefit: Existing network applications will run faster, with lower latency, higher throughput, and more CPU available to other services. Not application changes are required.

    How to Demonstrate:

    Compare your application's performance differences

    • using Solaris Nevada build 81 vs. Crossbow beta
    • using Solaris 10 vs. Crossbow beta
    Measure latency or throughput, depending on which is more important to your application, and also observe changes in CPU utilization.

    Improved IP Forwarding

    Requirement: Faster forwarding of IP datagrams.

    Feature: Faster forwarding of IP datagrams, especially as routing/forwarding tables get large.

    Benefit: Solaris is a better platform for routers and firewalls.

    How to Demonstrate:

    Compare your router's performance differences

    • using Solaris Nevada build 81 vs. Crossbow beta
    • using Solaris 10 vs. Crossbow beta
    Measure latency and throughput, and also observe differences in CPU utilization.

    Additional Info

    Nicolas' Private Virtual Network

    Sunay's blog on network in a box

    Karol's testing of Crossbow

    Thursday Feb 14, 2008

    Network Virtualization and Resource Control--Crossbow pre-Beta

    The pre-beta bits and updated material for Project Crossbow have been posted to the opensolaris.org web site. If splitting up a NIC into several virtual NICs, limiiting network bandwidth, allocating CPUs to specific network traffic, faster datagram forwarding, or enhance visibility of what your network traffic looks like, check it out.

    The code is available as a customized Nevada build 81 image, or you can install the BFU bits on top of an existing build 81 install. It may work with a slightly older or newer build (I did some testing with build 82), but that has not been fully tested.

    Plans are to put the features into Nevada after the beta period and your feedback.

    Thanks to the engineering for all the effort in getting this out! Many of my customers have been waiting for this to become available.

    Patches for Using IP Instances with ce NICs are Available

    The [Solaris 10] patches to be able to use IP Instances with the Cassini ethernet interface, known as ce, are available on sunsolve.sun.com for Solaris 10 users with a maintenance contract or subscription. (This is for Solaris 10 8/07, or a prior update patched to that level. These patches are included in Solaris 10 5/08, and also in patch clusters or bundles delivered at or around the same time, and since then.)

    The SPARC patches are:

    • 137042-01 SunOS 5.10: zoneadmd patch
    • 118777-12 SunOS 5.10: Sun GigaSwift Ethernet 1.0 driver patch

    The x86 patches are:

    • 137043-01 SunOS 5.10_x86: zoneadmd patch
    • 118778-11 SunOS 5.10_x86: Sun GigaSwift Ethernet 1.0 driver patch

    I have not been able to try out the released patches myself, yet.

    Steffen

    Thursday Feb 07, 2008

    Solaris 10 Update 5 Beta Program!

    The Solaris 10 Update 5 Beta Program solicitation for external beta candidates has gone out. To apply, go here.

    The new or enhanced features in S10 Update 5 Beta Release to be tested include but are not limited to:

    • Infiniband flash update tool
    • Sockets Direct Protocol (SDP)
    • Persistent Group Reservation for iSCSI target
    • iSNS Client for iSCSI target
    • IP addressing ability for IBTF interfaces
    • Graphical User Interface for PostgreSQL (pgAdmin 3)
    • Support for download of new firmware into SATA drives
    • Support for Enhanced Intel SpeedStep power management technology
    • SunVTS 7.0
    • SAS multipathing support
    • Flash 9
    • New Instant Messaging Client (pidgin 2.0)
    • Virtual Network Computing (VNC)
    • Capping CPU resource usage

    Crossbow beta coming soon

    If you have been interested in the features Crossbow has to offer, beta will be in March, prior to the putback of the code into Nevada. Keep your eyes open to the Crossbow Discussion Forum" or subscribe to crossbow-discuss@opensolaris.org.

    The beta will include VNICs, flows, improved IP forwarding, hardware classification. More details to come.

    About

    stw

    Search

    Archives
    « April 2014
    SunMonTueWedThuFriSat
      
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
       
           
    Today