Monday Nov 15, 2010

Solaris 11 Express 2010.11 is available!!

Congratulations to every for getting Solaris 11 Express out! With this release come a lot of networking improvements and features, including the following:
  • Network Virtualization and Resource Management (Crossbow) with VNICs and flows
  • New IP administrative interface (ipadm)
  • IPMP rearchitecture
  • IP observability and data link statistics (dlstat)
  • Link protection when a data link is given to a guest VM
  • Layer 2 bridging
  • More device types supported by dladm, including WiFi
  • Network Automagic for automatic network selection on desktops
  • Improved socket interface for better performance
Get more information here.

More to follow in the future!

Steffen

Thursday Jun 17, 2010

TCP Fusion and improved loopback traffic

In the past, when two processes were communicating using TCP on the same system, a lot of the TCP and IP protocol processing was performed just as it was for traffic to and from another system. Since a significant amount of CPU is spent on the protocol layers both sending and receiving to insure successful, complete, in order, non-duplicated, re-routed around network failures for data that never left the system, there is considerable performance benefit in providing a short circuit for the data.

In Solaris 10 6/06 a feature called TCP Fusion was delivered, which removed all the stack processing when both ends of the TCP connection are in the same system, and now with IP Instances, in the same IP Instance (between the global zone and all shared IP zones, or within an exclusive zone). There are some exceptions to this, including when using IPsec, IPQoS, raw-socket, kernel SSL, non-simple TCP/IP conditions. or the two end points are on different squeues. A fused connect will revert to unfused if an IP Filter rule will drop a packet. However TCP fusion is done in the general case.

So why do I bring this up? With TCP fusion enabled (which it is by default in Solaris 10 6/06 and later, and in OpenSolaris), when a TCP connection is created between processes on a system, the necessary things are set up to transfer data from the sender to the receiver without sending it down and back up the stack. The typical flow control of filling a send buffer (defaults to 48K or the value of tcp_xmit_hiwat, unless changed via a socket operation) still applies. With TCP Fusion on, there is a second check, which is the number of writes to the socket without a read. The reason for the counter is to allow the receiver to get CPU cycles, since the sender and receiver are on the same system and may be sharing one or more CPUs. The default value of this counter is eight (8), as determined by tcp_fusion_rcv_unread_min. The value per TCP connection is calculated as

MAX(sndbuf >> 14, tcp_fusion_rcv_unread_min);
Some details of the reasoning and implementation are in Change Request 4821256.

When doing large writes, or when the receiver is actively reading, the buffer flow control dominates. However, when doing smaller writes, it is easy for the sender to end up with a condition where the number of consecutive writes without a read is exceeded, and the writer blocks, or if using non-blocking I/O, will get an EAGAIN error.

The latter was a case at a customer of mine. An ISV application was reporting EAGAIN errors on a new installation, something that hadn't been seen before. More importantly, the ISV was also not seeing it elsewhere or in their test environment.

After some investigation using DTrace, including reproduction on slightly different system configuration, it became clear that the sending application was getting the error after a burst of writes. The application has both local and remote (on other systems) receivers, and the EAGAIN errors were only happening on the local connection.

I also saw that the application was repeatedly doing a pair of writes, one of 12 bytes and the second of 696 bytes. Thus it would be easy to hit the consecutive write counter before the write buffer is ever filled.

To test this I suggested the customer change the tcp_fusion_rcv_unread_min on their running system using mdb(1). I suggested they increase the counter by a factor of four (4), just to be safe.

# echo "tcp_fusion_rcv_unread_min/W 32" | mdb -kw
tcp_fusion_rcv_unread_min:      0x8            =       0x20
Here is how you check what the current value is.
# echo "tcp_fusion_rcv_unread_min/D" | mdb -k
tcp_fusion_rcv_unread_min:
tcp_fusion_rcv_unread_min:      32
After running several hours of tests, the EAGAIN error did not return.

Since then I have suggested they set tcp_fusion_rcv_unread_min to 0, to turn the check off completely. This will allow the buffer size and total outstanding write data volume to determine whether the sender is blocked, as it is for remote connections. Since the mdb is only good until the next reboot, I suggested the customer change the setting in /etc/system.

\* Set TCP fusion to allow unlimited outstanding writes up to the TCP send buffer set by default or the application.
\* The default value is 8.
set ip:tcp_fusion_rcv_unread_min=0
To turn TCP Fusion off all together, something I have not tested with, the variable do_tcp_fusion can be set from its default 1 to 0.

I hope this helps someone who might be trying to understand why errors, or maybe less than expected throughput, is being seen on local connections.

And I would like to note that in OpenSolaris only the do_tcp_fusion setting is available. With the delivery of CR 6826274, the consecutive write counting has been removed. The TCP Fusion code has also been moved into its own file

Thanks to Jim Eggers, Jim Fiori, Jim Mauro, Anders Parsson, and Neil Putnam for their help as I was tracking all this stuff down!

Steffen

PS. After publishing, I wrote this DTrace script to show what the per connection outstanding write counter tcp_fuse_rcv_unread_hiwater is set to.

# more tcp-fuse.d
#!/usr/sbin/dtrace -qs

fbt:ip:tcp_fuse_maxpsz_set:entry
{
        self->tcp = (tcp_t \*) arg0;
}

fbt:ip:tcp_fuse_maxpsz_set:return
/self->tcp > 0/
{
        this->peer = (tcp_t \*) self->tcp->tcp_loopback_peer;
        this->hiwat = this->peer->tcp_fuse_rcv_unread_hiwater;

        printf("pid: %d tcp_fuse_rcv_unread_hiwater: %d \\n", pid, this->hiwat);

        self->tcp = 0;
        this->peer = 0;
        this->hiwat = 0;
}

Wednesday Feb 24, 2010

My thoughts on configuring zones with shared IP instances and the 'defrouter' parameter

An occasional call or email I receive has questions about routing issues when using Solaris Zones in the (default) shared IP Instance configuration. Everything works well when the non-global zones are on the same IP subnet (lets say 172.16.1.0/24) as the global zone. Routing gets a little tricky when the non-global zones are on a different subnet.

My general recommendation is to isolate. This means:

  • Separate subnets for the global zone (administration, backup) and the non-global zones (applications, data).
  • Separate data-links for the global and non-global zones.
    • The non-global zones can share a data-link
    • Non-global zones on different IP subnets use different data-links
Using separate data-links is not always possible. I was concerned whether this would actually work.

So I did some testing, and exchanged some emails because of a comment I made regarding PSARC/2008/057 and the automatic removal of a default route when the zone is halted.

Turns out I have been very restrictive in suggesting that the global and non-global zones not share a data-link. While I think that is a good administrative policy, to separate administrative and application traffic, it is not a requirement. It is OK to have the global zone and one or more non-global zones share the same data-link. However, if the non-global zones are to have different default routes, they must be on subnets that the global zone is not on.

My test case running Solaris 10 10/09 has the global zone on the 129.154.53.0/24 network and the non-global zone on the 172.16.27.0/24 network.

global# ifconfig -a
...
e1000g0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
        inet 129.154.53.132 netmask ffffff00 broadcast 129.154.53.255
        ether 0:14:4f:ac:57:c4
e1000g0:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
        zone shared1
        inet 172.16.27.27 netmask ffffff00 broadcast 172.16.27.255

global# zonecfg -z shared1 info net
net:
        address: 172.16.27.27/24
        physical: e1000g0
        defrouter: 172.16.27.16

The routing table as seen from both are:
global# netstat -rn

Routing Table: IPv4
  Destination           Gateway           Flags  Ref     Use     Interface
-------------------- -------------------- ----- ----- ---------- ---------
default              129.154.53.215       UG        1        123
default              172.16.27.16         UG        1          7 e1000g0
129.154.53.0         129.154.53.132       U         1         50 e1000g0
224.0.0.0            129.154.53.132       U         1          0 e1000g0
127.0.0.1            127.0.0.1            UH        3         80 lo0

shared1# netstat -rn

Routing Table: IPv4
  Destination           Gateway           Flags  Ref     Use     Interface
-------------------- -------------------- ----- ----- ---------- ---------
default              172.16.27.16         UG        1          7 e1000g0
172.16.27.0          172.16.27.27         U         1          3 e1000g0:1
224.0.0.0            172.16.27.27         U         1          0 e1000g0:1
127.0.0.1            127.0.0.1            UH        4         78 lo0:1
While the global zone shows both routes, only the default applying to its subnet will be used. And for traffic leaving the non-global zone, only its default will be used.

You may notice that the Interface for the global zone's default router is blank. That is because I have set the default route via /etc/defaultrouter. I noticed that if it is determined via the route discovery daemon, it will be listed as being on e1000g0! This does not affect the behavior, however it may be visually confusing, which is probably why I initially leaned towards saying to not share the data-link.

There are multiple ways to determining which route might be used, including ping(1M) and traceroute(1M). I like the output of the route get command.

global# route get 172.16.29.1
   route to: 172.16.29.1
destination: default
       mask: default
    gateway: 129.154.53.1
  interface: e1000g0
      flags: <UP,GATEWAY,DONE,STATIC>
 recvpipe  sendpipe  ssthresh    rtt,ms rttvar,ms  hopcount      mtu     expire
       0         0         0         0         0         0      1500         0

shared1# route get 172.16.28.1
   route to: 172.16.28.1
destination: default
       mask: default
    gateway: 172.16.27.16
  interface: e1000g0:1
      flags: <UP,GATEWAY,DONE,STATIC>
 recvpipe  sendpipe  ssthresh    rtt,ms rttvar,ms  hopcount      mtu     expire
       0         0         0         0         0         0      1500         0
This quickly shows which interfaces and IP addresses are being used. If there are multiple default routes, repeated invocations of this will show a rotation in the selection of the default routes.

Thanks to Erik Nordmark and Penny Cotten for their insights on this topic!

Steffen Weiberle

Thursday Aug 20, 2009

Why are packets going out of the "wrong" interface?

I often refer to this blog by James Carlson, so to help others, and me, find it, here is Packets out of the wrong interface. Thanks James for all the help over the years!

Steffen

VLANs and Aggregations

Every once in a while I see the question asking whether it is possible to use IEEE 802.1q VLANs together with IEEE 802.3ad Link Aggregation. I frequently have to check myself. So in order to better remind me, and share with others, here is a quick demonstration of how to get the two working together.

My test system is running build 05 of the upcoming Solaris 10 10/09 (update 8). The system has four bge interfaces, and I will use numbers 1 and 2. (This should work just as well with previous updates of Solaris 10, and with Sun Trunking in Solaris 9, except for the zones parts. I am using zones just to isolate my traffic generation and easily get it to use a specific data link.)

Starting out things like like this.

global# dladm show-dev
bge0            link: up        speed: 1000  Mbps       duplex: full
bge1            link: unknown   speed: 0     Mbps       duplex: unknown
bge2            link: unknown   speed: 0     Mbps       duplex: unknown
bge3            link: unknown   speed: 0     Mbps       duplex: unknown
global# dladm show-link
bge0            type: non-vlan  mtu: 1500       device: bge0
bge1            type: non-vlan  mtu: 1500       device: bge1
bge2            type: non-vlan  mtu: 1500       device: bge2
bge3            type: non-vlan  mtu: 1500       device: bge3
global# ifconfig -a4
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
        inet 127.0.0.1 netmask ff000000
bge0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
        inet 129.154.53.125 netmask ffffff00 broadcast 129.154.53.255
        ether 0:3:ba:e3:42:8b
I have my switch set up to aggregate ports 1 and 2, and here is how I do it with Solaris 10.
global# dladm create-aggr -d bge1 -d bge2 1
global# dladm show-link
bge0            type: non-vlan  mtu: 1500       device: bge0
bge1            type: non-vlan  mtu: 1500       device: bge1
bge2            type: non-vlan  mtu: 1500       device: bge2
bge3            type: non-vlan  mtu: 1500       device: bge3
aggr1           type: non-vlan  mtu: 1500       aggregation: key 1
VLAN tagged interfaces are used by accessing the underlying data link by preceeding the data link ID with the VLAN tag. For bge1 and VLAN 111 that would be bge111001. For for aggr1 it would be aggr111001.

For this setup I am using zones zone111 and zone112 configured as an exclusive IP Instance. The zone configuration look like this.

global# zonecfg -z zone111 info
zonename: zone111
zonepath: /zones/zone111
brand: native
autoboot: false
bootargs:
pool:
limitpriv:
scheduling-class:
ip-type: exclusive
inherit-pkg-dir:
        dir: /lib
inherit-pkg-dir:
        dir: /platform
inherit-pkg-dir:
        dir: /sbin
inherit-pkg-dir:
        dir: /usr
net:
        address not specified
        physical: aggr111001
        defrouter not specified
Once configured, installed, and booted, the network configuration of zone111 is:
global# zlogin zone111 ifconfig -a4
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
        inet 127.0.0.1 netmask ff000000
aggr111001: flags=201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS> mtu 1500 index 2
        inet 172.16.111.141 netmask ffffff00 broadcast 172.16.111.255
        ether 0:3:ba:e3:42:8c
Turns out that configuring this was easy compared to showing that the link aggregation was really working. While the full list of links known when the zones are includes the aggregation and the VLANs on the aggregation, tools such a netstat or nicstat would not include them. As it turns out they only report on interfaces that are plumbed up in that IP Instance. It will not be possible to plumb either bge1 or bge2 since they are members of the aggregation.
global# dladm show-link
bge0            type: non-vlan  mtu: 1500       device: bge0
bge1            type: non-vlan  mtu: 1500       device: bge1
bge2            type: non-vlan  mtu: 1500       device: bge2
bge3            type: non-vlan  mtu: 1500       device: bge3
aggr1           type: non-vlan  mtu: 1500       aggregation: key 1
aggr111001      type: vlan 111  mtu: 1500       aggregation: key 1
aggr112001      type: vlan 112  mtu: 1500       aggregation: key 1
global# netstat -i
Name  Mtu  Net/Dest      Address        Ipkts  Ierrs Opkts  Oerrs Collis Queue
lo0   8232 loopback      localhost      98     0     98     0     0      0
bge0  1500 pinebarren    pinebarren     43101  0     7181   0     0      0
So I ended up using kstat(1M) to get the values of the number of outbound packets. I an interested in outbound as that is what Solaris can affect regarding distributing traffic across links in an aggregation--the switch determines that for inbound traffic.

This example shows data on instance 2 of the bge interface for kstat value opackets.

global# kstat -m bge -i 2 -s opackets
module: bge                             instance: 2
name:   mac                             class:    net
        opackets                        2542
With kstat I can see that for different connections either bge1 or bge2 has packets going out on it. A good test for me was scp to a remote system. Neither ping nor traceroute caused the necessary hashing to use both links in the aggregation.

Steffen

Monday Jun 01, 2009

OpenSolaris 2009.06 Delivers Crossbow (Network Virtualization and Resource Control)

Today OpenSolaris 2009.06, the third release of OpenSolaris, is announced and available for download. Among the many features in this version is the delivery of Project Crossbow, in a fully supported distribution. This brings network virtualization, including Virtual NICs (VNICs), bandwidth control and management, flow (QoS) creation and management, virtual switches, and other features to OpenSolaris.

Network virtualization joins a number of other features already in OpenSolaris, such as vanity naming (allowing custom names for data links), snooping on loopback for better observability, a re-architected IPMP with an administrative interface, and Network Automagic (NWAM--automatic configuration of desktop networking based on available wired and wireless network services).

Congratulations to everyone who made all this possible!

Steffen PS: Regarding the fully supported, please notice the new support prices and durations!

Thursday Apr 16, 2009

What happened to my packets? -- or -- Dual default routes and shared IP zones

I recently received a call from someone who has helped me out a lot on some performance issues (thanks, Jim Fiori), and I was glad to be able to return even a small part of those favors!

He had been contacted to help a customer who was ready to deploy a web application, and they were experiencing intermittent lack of connection to the web site. Interestingly, they were also using zones, a bunch of them (OK, a handful)--and so right up my alley.

The customer was running a multi-tiered web application on an x4600 (so Solaris on x86 as well!), with the web server, web router, and application tiers in different zones. They were using shared IP Instances, so all the network configuration was being done in the global zone.

Initially, we had to modify some configuration parameters, especially regarding default routes. Since the system was installed with Solaris 10 5/08 and had more recent patches, we could use the defrouter feature introduced in 10/08 to make setting up routes for the non-global zones a little easier. This was needed because the global zone was using only one NIC, and it was not going to be on the networks that the non-global zones were on.

What made the configuration a little unique was that the web server needs a default router to the Internet, while the application server needs a route to other systems behind a different router. Individually, everything is fine. However, the web1 zone also needs to be on the network that the application and web router are on, so it ends up having two interfaces.

Lets look at web1 when only it is running.

web1# ifconfig -a4
lo0:1: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
        inet 127.0.0.1 netmask ff000000
bge1:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
        inet 172.16.1.41 netmask ffffff00 broadcast 172.16.1.255
bge2:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 4
        inet 192.168.51.41 netmask ffffff00 broadcast 192.168.51.255
web1# netstat -rn
Routing Table: IPv4
  Destination           Gateway           Flags  Ref     Use     Interface
-------------------- -------------------- ----- ----- ---------- ---------
default              172.16.1.1           UG        1          0 bge1
172.16.1.0           172.16.1.41          U         1          0 bge1:1
192.168.51.0         192.168.51.41        U         1          0 bge2:1
224.0.0.0            172.16.1.41          U         1          0 bge1:1
127.0.0.1            127.0.0.1            UH        5         34 lo0:1

The zone is on two interface, bge1 and bge2, and has a default route that uses bge1. However, when zone app1 is running, there is a second default route, on bge2. The same is true if app2 or odr are running. Note that these three zones are only on bge2.

app1# ifconfig -a4
lo0:1: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
        inet 127.0.0.1 netmask ff000000
bge2:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 4
        inet 192.168.51.43 netmask ffffff00 broadcast 192.168.51.255
app1# netstat -rn
Routing Table: IPv4
  Destination           Gateway           Flags  Ref     Use     Interface
-------------------- -------------------- ----- ----- ---------- ---------
default              192.168.51.1         UG        1          0 bge2
192.168.51.0         192.168.51.43        U         1          0 bge2:1
224.0.0.0            192.168.51.43        U         1          0 bge2:1
127.0.0.1            127.0.0.1            UH        3         51 lo0:1

In the meantime, this is what happens in web1.

web1# netstat -rn

Routing Table: IPv4
  Destination           Gateway           Flags  Ref     Use     Interface
-------------------- -------------------- ----- ----- ---------- --------- 
default              192.168.51.1         UG        1          0 bge2
default              172.16.1.1           UG        1          0 bge1 
172.16.1.0           172.16.1.41          U         1          0 bge1:1
192.168.51.0         192.168.51.41        U         1          0 bge2:4
224.0.0.0            172.16.1.41          U         1          0 bge1:1
127.0.0.1            127.0.0.1            UH        6        132 lo0:4

With any of the other zones running, web1 now has two default routes. And it only happens in web1, as it is the only zone with its public facing data link bge1 and a shared data link (bge2).

Traffic to any system on either the 192.168.51.0 or 172.16.1.1 network will have no issues. Every time IP needs to determine a new path for a system not on either of those two networks, it will pick a route, and it will round-robin between the two default routes. Thus approximately half the time, connections will fail to establish, or possibly existing connections will not work if they have been idle for a while.

This is how IP is supposed to work, so there is technically nothing wrong. It is a features of zones and a shared IP Instance. [2009.06.23: For background on why IP works this way, see James' blog].

The only problem is that this is not what the customer wants!

One option would be to force all traffic between the web and application tier out the bge1 interface, putting it on the wire. This may not be desirable for security reasons, and introduces latencies since traffic now goes on the wire. Another option would be to use exclusive IP Instances for the web servers. For each web zone, and this example only has one, it would required two additional data links (NICs). That would add up. Also, this configuration is targeted to be used with Solaris Cluster's scalable services, and those must be in shared IP Instance zones. Hummm....as I like to say.

We didn't know about the shared IP Instance restriction of Solaris Cluster, and as the customer was considering how they were going to add additional NICs to all the systems, something slowly developed in my mind. How about creating a shared, dummy network between the web and application tier? They had one spare NIC, and with shared IP it does not even need to be connected to a switch port, since IP will loop all traffic back anyway!

The more I thought about it, the more I liked it, and I could not see anything wrong with it. At least not technically as I understood Solaris. Operationally, for the customer, it might be a little awkward.

Here is what I was thinking of...

With this configuration the web1 zone has a default router only to the Internet and it can reach odr, and if necessary, app1 and app2, directly via the new network. And app1 and app2 only have a single default route to get to the Intranet. The nice thing is that bge3 does not even need to be up. That is visible with ifconfig output, where bge3 is not showing a RUNNING flag, which indicates the port is not connected (or in my case has been disabled on the switch).

global# ifconfig -a4
...
bge0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
        inet 129.154.53.125 netmask ffffff00 broadcast 129.154.53.255
        ether 0:3:ba:e3:42:8b
bge1: flags=1000842<BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
        inet 0.0.0.0 netmask 0
        ether 0:3:ba:e3:42:8c
bge2: flags=1000842<BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 4
        inet 0.0.0.0 netmask 0
        ether 0:3:ba:e3:42:8d 
bge3: flags=1000802<BROADCAST,MULTICAST,IPv4> mtu 1500 index 5 
        inet 0.0.0.0 netmask 0
        ether 0:3:ba:e3:42:8e
...
And within web1 there is now only one default route.
web1# netstat -rn

Routing Table: IPv4
  Destination           Gateway           Flags  Ref     Use     Interface
-------------------- -------------------- ----- ----- ---------- --------- 
default              172.16.1.1           UG        1         17 bge1 
172.16.1.0           172.16.1.41          U         1          2 bge1:1
192.168.52.0         192.168.52.41        U         1          2 bge3:1
224.0.0.0            172.16.1.41          U         1          0 bge1:1
127.0.0.1            127.0.0.1            UH        4        120 lo0:1
In the customer's case, multiple systems were being used, so the private networks were connected together so that a web zone on one system could access an odr zone on another. I am showing the simple, single system case since it is so convenient.

If I were using Solaris Express Community Edition (SX-CE) or OpenSolaris 2009.06 Developer Builds, with the Crossbow bits and virtual NICs (VNICs) available, I wouldn't even have needed to use that physical interface. Both are available here.

I hope this trick might help others out in the future.

Steffen

Tuesday Apr 14, 2009

Using IPMP with link based failure detection

Solaris has had a feature to increase network availability called IP Multipathing (IPMP). Initially it required a test address on every data link in an IPMP group, where the test addresses were used as the source IP address to probe network elements for path availability. One of the benefits of probe-based failure detection is that it can extend beyond the directly connected link(s), and verify paths through the attached switch(es) to what typically is a router or other redundant element to provide available services.

Having one IP address (whether a public or a private, non routable) per data link and also the separate address(es) for the application(s) turns out to be a lot of addresses to allocate and administer. And since the default of five probes spaced two seconds apart meant a failure would take at least ten (10) seconds to be detected, something more was needed.

So in the Solaris 9 timeframe the ability to also do link based failure detection was delivered. It requires specific NICs whose driver has the ability to notify the system that a link has failed. The Introduction to IPMP in the Solaris 10 Systems Administrators Guide on IP Services lists the NICs that support link state notification. Solaris 10 supports configuring IPMP with only link based failure detection.

global# more /etc/hostname.bge[12]
::::::::::::::
/etc/hostname.bge1
::::::::::::::
10.1.14.140/26 group ipmp1 up
::::::::::::::
/etc/hostname.bge2
::::::::::::::
group ipmp1 standby up
On system boot, there will be an indication on the console that since no test addresses are defined, probe-based failure detection is disabled.

Apr 10 10:57:20 in.mpathd[168]: No test address configured on interface bge2; disabling probe-based failure detection on it
Apr 10 10:57:20 in.mpathd[168]: No test address configured on interface bge1; disabling probe-based failure detection on it
Looking at the interfaces configured,
global# ifconfig -a4
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
        inet 127.0.0.1 netmask ff000000
bge0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
        inet 129.154.53.125 netmask ffffff00 broadcast 129.154.53.255
        ether 0:3:ba:e3:42:8b
bge1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
        inet 10.1.14.140 netmask ffffffc0 broadcast 10.1.14.191
        groupname ipmp1
        ether 0:3:ba:e3:42:8c
bge1:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
        inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
bge2: flags=69000842<BROADCAST,RUNNING,MULTICAST,IPv4,NOFAILOVER,STANDBY,INACTIVE> mtu 0 index 4
        inet 0.0.0.0 netmask 0
        groupname ipmp1
        ether 0:3:ba:e3:42:8d
you will notice that two of the three interfaces have no address (0.0.0.0). Also, the data address is on a physical interface on bge1. At the same time bge2 has the 0.0.0.0 address. On the failure of bge1,
Apr 10 14:34:53 global bge: NOTICE: bge1: link down
Apr 10 14:34:53 global in.mpathd[168]: The link has gone down on bge1
Apr 10 14:34:53 global in.mpathd[168]: NIC failure detected on bge1 of group ipmp1
Apr 10 14:34:53 global in.mpathd[168]: Successfully failed over from NIC bge1 to NIC bge2


global# ifconfig -a4
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
        inet 127.0.0.1 netmask ff000000
bge0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
        inet 129.154.53.125 netmask ffffff00 broadcast 129.154.53.255
        ether 0:3:ba:e3:42:8b
bge1: flags=19000802<BROADCAST,MULTICAST,IPv4,NOFAILOVER,FAILED> mtu 0 index 3
        inet 0.0.0.0 netmask 0
        groupname ipmp1
        ether 0:3:ba:e3:42:8c
bge2: flags=21000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,STANDBY> mtu 1500 index 4
        inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
        groupname ipmp1
        ether 0:3:ba:e3:42:8d
bge2:1: flags=21000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,STANDBY> mtu 1500 index 4
        inet 10.1.14.140 netmask ffffffc0 broadcast 10.1.14.191
the data address is migrated onto bge2:1. I find this a little confusing. However, I don't know any way around it on Solaris 10. The IPMP Re-architecture makes this a lot easier!

Using Probe-based IPMP with non-global zones

Configuring a shared IP Instance non-global zone and utilizing IPMP managed in the global zone is very easy.

The IPMP configuration is very simple. Interface bge1 is active, and bge2 is in stand-by mode.

global# more /etc/hostname.bge[12]
::::::::::::::
/etc/hostname.bge1
::::::::::::::
group ipmp1 up
::::::::::::::
/etc/hostname.bge2
::::::::::::::
group ipmp1 standby up
My zone configuration is:
global# zonecfg -z zone1 info
zonename: zone1
zonepath: /zones/zone1
brand: native
autoboot: false
bootargs:
pool:
limitpriv:
scheduling-class:
ip-type: shared
inherit-pkg-dir:
        dir: /lib
inherit-pkg-dir:
        dir: /platform
inherit-pkg-dir:
        dir: /sbin
inherit-pkg-dir:
        dir: /usr
net:
        address: 10.1.14.141/26
        physical: bge1
Prior to booting, the network configuration is:
global# ifconfig -a4
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
        inet 127.0.0.1 netmask ff000000
lo0:1: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
        zone zone1
        inet 127.0.0.1 netmask ff000000
bge0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
        inet 129.154.53.125 netmask ffffff00 broadcast 129.154.53.255
        ether 0:3:ba:e3:42:8b
bge1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
        inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
        groupname ipmp1
        ether 0:3:ba:e3:42:8c
bge2: flags=21000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,STANDBY> mtu 1500 index 4
        inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
        groupname ipmp1
        ether 0:3:ba:e3:42:8d
After booting, the network looks like this:
global# ifconfig -a4
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
        inet 127.0.0.1 netmask ff000000
lo0:1: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
        zone zone1
        inet 127.0.0.1 netmask ff000000
bge0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
        inet 129.154.53.125 netmask ffffff00 broadcast 129.154.53.255
        ether 0:3:ba:e3:42:8b
bge1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
        inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
        groupname ipmp1
        ether 0:3:ba:e3:42:8c
bge1:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
        zone zone1
        inet 10.1.14.141 netmask ffffffc0 broadcast 10.1.14.191
bge2: flags=21000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,STANDBY> mtu 1500 index 4
        inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
        groupname ipmp1
        ether 0:3:ba:e3:42:8d

So a simple case for the use of IPMP, without the need for test addresses! Other IPMP configurations, such as more than two data links, or active-active, are also supported with link based failure detection. The more links involved, the more test addresses are saved with link based failure detection. Since writing this entry I was involved in a customer configuration where this is saving several hundred IP address and their management (such as avoiding duplicate address). That customer is willing to forgo the benefit of probes testing past the local switch port.

Steffen

Tuesday Jan 13, 2009

Using zonecfg defrouter with shared-IP zones

[Update to IPMP testing 2009.01.20]

[Minor update 2009.01.14]

When running Solaris Zones in a shared-IP configuration, all network configurations are determined by how the zone is configured using zonecfg(1M) or by what the global zone's IP determines things should be (such as routes). This has caused some trouble in situations where zones are on different subnets, and especially if the global zone is not on the subnet(s) the non-global zones are on. While exclusive IP Instances were delivered to help address these cases, using exclusive IP Instances requires a data link per zone, and if running a large number of zones there may not be enough data links available.

With Solaris 10 10/08 (Update 6), an additional network configuration parameter is available for shared-IP zones. This is the default router (defrouter) optional parameter.

Using the defrouter parameter, it is possible to set which router to use for traffic leaving the zone. In the global zone, default router entries are created the first time the zone is booted. Note that the entries are not deleted when the zone is halted.

The defrouter property looks like this for a zone with it configured.

global# zonecfg -z shared1 info net
net:
        address: 10.1.14.141/26
        physical: bge1
        defrouter: 10.1.14.129
And it looks like this if it is not set.
global# zonecfg -z shared1 info net
net:
        address: 10.1.14.141/26
        physical: bge1
        defrouter not specified
So I have run a variety of configurations, and some thing I observed are as follows. (Most of the configurations used a separate interface for the global zone (bge0) than for the non-global zones (bge1 and bge2). IPMP is not being used in these configurations. A comment on that at the end.) The [#] indicate examples in the outputs that follow.
  • A default route entry is create for the NIC [1] on which the zone is configured when the zone is booted. [2]
  • Entries are not deleted when a zone is halted. They persist until manually removed[3] or a reboot of the global zone.
  • It is possible to have the same default router configured for multiple zones. [4]
  • It is possible to have the same default router listed on multiple interfaces. \* [5]
  • It is possible to have multiple default routers on the same interface, even on different IP subnets. [6]
  • The interface used for outbound traffic is the one the zone is assigned to. [7]
  • It is sufficient to plumb the interface for the non-global zones in the global zone (thus it has 0.0.0.0 as its IP address in the global zone). [8]
  • The physical interface can be down in the global zone. [9]
  • If only one interface is used, and different subnets for the global and non-global zones are configured, routing works when setting defrouter [10] and does not work if it is not set.
The most interesting thing I noticed was that although two non-global zones may be on the same IP subnet, if they are configured on different interfaces, the traffic leaves the system on the interface that the zone is configured to be on. This is not the case typically when using shared IP and also having an IP address for the subnet in the global zone.

\* Note: Having two interfaces on the same IP subnet without configuring IP Multipathing (IPMP) may not be a supported configuration. I am looking for documentation that states this one way or another. [2009.01.14]

Examples

1. Single Zone, Single Interface--The Basics

Create a single non-global zone.
global# netstat -rn

Routing Table: IPv4
  Destination           Gateway           Flags  Ref     Use     Interface
-------------------- -------------------- ----- ----- ---------- ---------
default              139.164.63.215       UG        1          2 bge0
139.164.63.0         139.164.63.125       U         1          1 bge0
224.0.0.0            139.164.63.125       U         1          0 bge0
127.0.0.1            127.0.0.1            UH        1         42 lo0

global# zonecfg -z shared1 info net
net:
        address: 10.1.14.141/26
        physical: bge1
        defrouter: 10.1.14.129

global# zoneadm -z shared1 boot [2]

global# netstat -rn

Routing Table: IPv4
  Destination           Gateway           Flags  Ref     Use     Interface
-------------------- -------------------- ----- ----- ---------- ---------
default              139.164.63.215       UG        1          2 bge0
default              10.1.14.129          UG        1          0 bge1 [1]
139.164.63.0         139.164.63.125       U         1          1 bge0
224.0.0.0            139.164.63.125       U         1          0 bge0
127.0.0.1            127.0.0.1            UH        1         42 lo0

global# zoneadm -z shared1 halt

global# zoneadm list -v
  ID NAME             STATUS     PATH                           BRAND    IP
   0 global           running    /                              native   shared

global# netstat -rn

Routing Table: IPv4
  Destination           Gateway           Flags  Ref     Use     Interface
-------------------- -------------------- ----- ----- ---------- ---------
default              10.1.14.129          UG        1          0 bge1
default              139.164.63.215       UG        1          1 bge0
139.164.63.0         139.164.63.125       U         1          1 bge0
224.0.0.0            139.164.63.125       U         1          0 bge0
127.0.0.1            127.0.0.1            UH        1         42 lo0

global# route delete default 10.1.14.129 [3]
delete net default: gateway 10.1.14.129

global# netstat -rn

Routing Table: IPv4
  Destination           Gateway           Flags  Ref     Use     Interface
-------------------- -------------------- ----- ----- ---------- ---------
default              139.164.63.215       UG        1          1 bge0
139.164.63.0         139.164.63.125       U         1          1 bge0
224.0.0.0            139.164.63.125       U         1          0 bge0
127.0.0.1            127.0.0.1            UH        1         42 lo0

2. Multiple Interfaces, Same Default Router

Three zones, where two use bge1 and the third uses bge2. All use the same default router.
global# netstat -rn

Routing Table: IPv4
  Destination           Gateway           Flags  Ref     Use     Interface
-------------------- -------------------- ----- ----- ---------- ---------
default              139.164.63.215       UG        1          1 bge0
139.164.63.0         139.164.63.125       U         1          1 bge0
224.0.0.0            139.164.63.125       U         1          0 bge0
127.0.0.1            127.0.0.1            UH        1         42 lo0

global# zonecfg -z shared1 info net
net:
        address: 10.1.14.141/26
        physical: bge1
        defrouter: 10.1.14.129 [4]

global# zonecfg -z shared2 info net
net:
        address: 10.1.14.142/26
        physical: bge1
        defrouter: 10.1.14.129 [4]

global# zonecfg -z shared3 info net
net:
        address: 10.1.14.143/26
        physical: bge2
        defrouter: 10.1.14.129 [5]

global# zoneadm -z shared1 boot

global# zoneadm -z shared2 boot

global# zoneadm -z shared3 boot

global# netstat -rn

Routing Table: IPv4
  Destination           Gateway           Flags  Ref     Use     Interface
-------------------- -------------------- ----- ----- ---------- ---------
default              10.1.14.129          UG        1          0 bge1 [4]
default              139.164.63.215       UG        1          1 bge0
default              10.1.14.129          UG        1          2 bge2 [5]
139.164.63.0         139.164.63.125       U         1          1 bge0
224.0.0.0            139.164.63.125       U         1          0 bge0
127.0.0.1            127.0.0.1            UH        1         42 lo0

global# zoneadm list -v
  ID NAME             STATUS     PATH                           BRAND    IP
   0 global           running    /                              native   shared
   3 shared1          running    /zones/shared1                 native   shared
   4 shared2          running    /zones/shared2                 native   shared
   5 shared3          running    /zones/shared3                 native   shared

global# ifconfig -a4
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
        inet 127.0.0.1 netmask ff000000
lo0:1: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
        zone shared1
        inet 127.0.0.1 netmask ff000000
lo0:2: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
        zone shared2
        inet 127.0.0.1 netmask ff000000
lo0:3: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
        zone shared3
        inet 127.0.0.1 netmask ff000000
bge0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
        inet 139.164.63.125 netmask ffffff00 broadcast 139.164.63.255
        ether 0:3:ba:e3:42:8b
bge1: flags=1000842<BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
        inet 0.0.0.0 netmask 0
        ether 0:3:ba:e3:42:8c
bge1:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
        zone shared1
        inet 10.1.14.141 netmask ffffffc0 broadcast 10.1.14.191
bge1:2: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
        zone shared2
        inet 10.1.14.142 netmask ffffffc0 broadcast 10.1.14.191
bge2: flags=1000842<BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 4
        inet 0.0.0.0 netmask 0
        ether 0:3:ba:e3:42:8d
bge2:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 4
        zone shared3
        inet 10.1.14.143 netmask ffffffc0 broadcast 10.1.14.191

3. Multiple Subnets

Add another zone, using bge2 and on a different subnet.
global# zonecfg -z shared4 info net
net:
        address: 192.168.16.144/24
        physical: bge2
        defrouter: 192.168.16.129

global# zoneadm -z shared4 boot

global# netstat -rn

Routing Table: IPv4
  Destination           Gateway           Flags  Ref     Use     Interface
-------------------- -------------------- ----- ----- ---------- ---------
default              10.1.14.129          UG        1          0 bge1
default              10.1.14.129          UG        1          4 bge2
default              139.164.63.215       UG        1          3 bge0
default              192.168.16.129       UG        1          0 bge2 [6]
139.164.63.0         139.164.63.125       U         1          4 bge0
224.0.0.0            139.164.63.125       U         1          0 bge0
127.0.0.1

4. Interface Usage

Issue some pings within the non-global zones and see which network interfaces are used. From the global zone, I issue a ping to a remote system (on the same network as the global zone (139.164.63.0), and see which interfaces are being used. [7]
global# zlogin shared1 ping 139.164.63.38
139.164.63.38 is alive

global# zlogin shared2 ping 139.164.63.38
139.164.63.38 is alive

global# zlogin shared3 ping 139.164.63.38
139.164.63.38 is alive

global# zlogin shared4 ping 139.164.63.38
139.164.63.38 is alive
This shows the pings originating from shared1 and shared2 going out on bge1.
global1# snoop -d bge1 icmp
Using device /dev/bge1 (promiscuous mode)
 10.1.14.141 -> 139.164.63.38 ICMP Echo request (ID: 4677 Sequence number: 0)
139.164.63.38 -> 10.1.14.141  ICMP Echo reply (ID: 4677 Sequence number: 0)
 10.1.14.142 -> 139.164.63.38 ICMP Echo request (ID: 4681 Sequence number: 0)
139.164.63.38 -> 10.1.14.142  ICMP Echo reply (ID: 4681 Sequence number: 0)
And this shows the pings originating from shared3 and shared4 going out on bge2.
global2# snoop -d bge2 icmp
Using device /dev/bge2 (promiscuous mode)
 10.1.14.143 -> 139.164.63.38 ICMP Echo request (ID: 4685 Sequence number: 0)
139.164.63.38 -> 10.1.14.143  ICMP Echo reply (ID: 4685 Sequence number: 0)
192.168.16.144 -> 139.164.63.38 ICMP Echo request (ID: 4689 Sequence number: 0)
139.164.63.38 -> 192.168.16.144 ICMP Echo reply (ID: 4689 Sequence number: 0)
Just to confirm where each zone is configured, here is the ifconfig output.
global# ifconfig -a4
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
        inet 127.0.0.1 netmask ff000000
lo0:1: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
        zone shared1
        inet 127.0.0.1 netmask ff000000
lo0:2: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
        zone shared2
        inet 127.0.0.1 netmask ff000000
lo0:3: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
        zone shared3
        inet 127.0.0.1 netmask ff000000
lo0:4: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
        zone shared4
        inet 127.0.0.1 netmask ff000000
bge0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
        inet 139.164.63.125 netmask ffffff00 broadcast 139.164.63.255
        ether 0:3:ba:e3:42:8b
bge1: flags=1000842<BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3 [9]
        inet 0.0.0.0 netmask 0 [8]
        ether 0:3:ba:e3:42:8c
bge1:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
        zone shared1
        inet 10.1.14.141 netmask ffffffc0 broadcast 10.1.14.191
bge1:2: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
        zone shared2
        inet 10.1.14.142 netmask ffffffc0 broadcast 10.1.14.191
bge2: flags=1000842<BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 4
        inet 0.0.0.0 netmask 0 [8]
        ether 0:3:ba:e3:42:8d
bge2:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 4
        zone shared3
        inet 10.1.14.143 netmask ffffffc0 broadcast 10.1.14.191
bge2:2: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 4
        zone shared4
        inet 192.168.16.144 netmask ffffff00 broadcast 192.168.16.255

5. Using a Single Interface

Only using bge0 and using different subnets for the global and non-global zones. [10]

Before booting the zone.

global# netstat -nr

Routing Table: IPv4
  Destination           Gateway           Flags  Ref     Use     Interface
-------------------- -------------------- ----- ----- ---------- ---------
default              139.164.63.215       UG        1          2 bge0
139.164.63.0         139.164.63.125       U         1          2 bge0
224.0.0.0            139.164.63.125       U         1          0 bge0
127.0.0.1            127.0.0.1            UH        1         42 lo0

global# zonecfg -z shared17 info net
net:
        address: 192.168.17.147/24
        physical: bge0
        defrouter: 192.168.17.16

global# zoneadm -z shared17 boot
Once the zone is booted, netstat shows both default routes, and a ping from the zone works.
global# netstat -rn

Routing Table: IPv4
  Destination           Gateway           Flags  Ref     Use     Interface
-------------------- -------------------- ----- ----- ---------- ---------
default              139.164.63.215       UG        1          2 bge0
default              192.168.17.16        UG        1          0 bge0
139.164.63.0         139.164.63.125       U         1          2 bge0
224.0.0.0            139.164.63.125       U         1          0 bge0
127.0.0.1            127.0.0.1            UH        1         42 lo0

global# ifconfig -a4
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
        inet 127.0.0.1 netmask ff000000
lo0:1: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
        zone shared17
        inet 127.0.0.1 netmask ff000000
bge0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
        inet 139.164.63.125 netmask ffffff00 broadcast 139.164.63.255
        ether 0:3:ba:e3:42:8b
bge0:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
        zone shared17
        inet 192.168.17.147 netmask ffffff00 broadcast 192.168.17.255

global# zlogin shared17 ping 139.164.63.38
139.164.63.38 is alive

IP Multipathing (IPMP)

I did some testing with IPMP and similar examples as above. At this time the combination of IPMP and the defrouter configuration does not work. I have filed bug 6792116 to have this looked at.

[Updated 2009.01.20] After some addtional testing, especially with test addresses and probe based failure detection, I have seen IPMP work well only when zones are configured such that at least one zone is on each NIC in an IPMP group, including a standby NIC. For example, if you have two NICs, bge1 and bge2, at least one zone must be configured on bge1 and at least one on bge2. This is even the case when one of the NICs is in failed mode when the system or zone(s) boot. It turns out that the default route is added when the zone boot, and there is no later check for default route requirements as a zone is moved from one NIC to another based on IPMP failover or failback. Thus, I would recommend not using defrouter and IPMP together until the conbination is confirmed to work.

If this is important for your deployments, please add a service record to change request 6792116 and work with your service provide to have this addressed. Please also note that this works well with the IPMP Re-architecture coming soon to OpenSolaris.

Wednesday Mar 26, 2008

How to BFU a System

Sometimes you want to try out a new feature not yet delivered into Solaris Nevada, and you have apply binaries using BFU. I imagine if you do this all the time, you know all the tricks and gotchas. I don't do it often enough and sometimes get caught up in some details. So here are the steps I tend to use.

First, get the latest BFU package from the ON (OS/Net) Consolidation. I typically only use the SUNWonbld tar file for my hardware.

Download the bits you want to install, such as those for Crossbow Beta or Clearview's snoop on loopback

To make life a little simpler, I add the following to root's .profile file.

if [ -d /opt/onbld ]
then
   FASTFS=/opt/onbld/bin/`uname -p`/fastfs ; export FASTFS
   BFULD=/opt/onbld/bin/`uname -p`/bfuld ; export BFULD
   GZIPBIN=/usr/bin/gzip ; export GZIPBIN
   PATH=$PATH:/opt/onbld/bin
fi

Now to apply the bits. After unpacking the bits into a temporary location, lets say /tmp/bfu, install the onbld package.

# pkgadd -d onbld all

Processing package instance  from 

OS-Net Build Tools(sparc) 11.11,REV=2008.03.18.14.39
Copyright 2008 Sun Microsystems, Inc.  All rights reserved.
Use is subject to license terms.

...

Installation of  was successful.
#
I re-read my .profile, and verify that the necessary BFU variables are set
# . /.profile
# echo $FASTFS
/opt/onbld/bin/sparc/fastfs
Now apply the BFU (this one is for Crossbow beta). You must use the full pathname!

Note: you may want to do this from the console, in case you loose your network connection.

# bfu `pwd`/nightly-nd
Copying /opt/onbld/bin/bfu to /tmp/bfu.1000
Executing /tmp/bfu.1000 /tmp/bfu/nightly-nd

...

Entering post-bfu protected environment (shell: ksh).
Edit configuration files as necessary, then reboot.

bfu#
Note that you end up in the BFU shell. Now issue an automatic conflict resolution check.
bfu# /opt/onbld/bin/acr
Getting ACR information from /tmp/bfu/nightly-nd... ok

updating //platform/sun4v/boot_archive
Finished.  See /tmp/acr.nhaqVi/allresults for complete log.
bfu#

bfu# exit
Exiting post-bfu protected environment.  To reenter, type:
LD_NOAUXFLTR=1 LD_LIBRARY_PATH=/tmp/bfulib LD_LIBRARY_PATH_64=/tmp/bfulib/64 
PATH=/tmp/bfubin /tmp/bfubin/ksh
#
Its time to reboot and run with the new bits!

Thursday Feb 14, 2008

Patches for Using IP Instances with ce NICs are Available

The [Solaris 10] patches to be able to use IP Instances with the Cassini ethernet interface, known as ce, are available on sunsolve.sun.com for Solaris 10 users with a maintenance contract or subscription. (This is for Solaris 10 8/07, or a prior update patched to that level. These patches are included in Solaris 10 5/08, and also in patch clusters or bundles delivered at or around the same time, and since then.)

The SPARC patches are:

  • 137042-01 SunOS 5.10: zoneadmd patch
  • 118777-12 SunOS 5.10: Sun GigaSwift Ethernet 1.0 driver patch

The x86 patches are:

  • 137043-01 SunOS 5.10_x86: zoneadmd patch
  • 118778-11 SunOS 5.10_x86: Sun GigaSwift Ethernet 1.0 driver patch

I have not been able to try out the released patches myself, yet.

Steffen

Monday Nov 05, 2007

Using IP Instances with VLANs or How to Make a Few NICs Look Like Many

[Minor editorial and clarification updates 2009.09.28]

Solaris 10 8/07 includes a new feature for zone networking. IP Instances is the facility to give a non-global zone its own complete control over the IP stack, which previously was shared with and controlled by the global zone.

A zone that has an exclusive IP Instance can set interface parameters using ifconfig(1M), put an interface into promiscuous mode to run snoop(1M), be a DHCP client or server, set ndd(1M) variables, have its own IPsec policies, etc.

One requirement for an exclusive IP Instance is that it must have exclusive access to a link name. This is any NIC, VLAN-tagged NIC component, or aggregation at this time. When they become available, virtual NICs will make this much simpler, as a single NIC can be presented to the zones using a number of VNICs, effectively multiplexing access to that NIC. A link name is an entry that can be found in /dev, such as /dev/bge0, /dev/bge321001 (VLAN tag 321 on bge1), aggr2, and so on.

To see what link names are available on a system, use dladm(1M) with the show-link option. For example:

global# dladm show-link
bge0            type: non-vlan  mtu: 1500       device: bge0
bge1            type: non-vlan  mtu: 1500       device: bge1
bge2            type: non-vlan  mtu: 1500       device: bge2
bge3            type: non-vlan  mtu: 1500       device: bge3

As folks have started to use IP Instances to isolate their zones, they have noticed that they don't have sufficient link names (I'll use just link in the rest of this) to assigned to the zones that have or wish to configure as exclusive. So, how does a global zone administrator configure a large number of zones as exclusive?

Let's consider the following situation, where there are three tiers of a web service, where each tier is on a different network.

If each server has only one NIC, the total number of switch ports required is at least eight (8). If each server has a management port, that is another eight ports, even if they are on a different, management network. Add to that at least three three switch ports going to the router.

Consolidating the servers onto a single Solaris 10 instance using exclusive IP Instances requires at least eight NICs for the services (one per service), and at least one for the global zone and management. (We'll ignore a service process requirements, since they are separate anyway, and access could be either via a serial interface or a network.)

One option to consider is using VLANs and VLAN tagging. When using VLAN tagging, additional information is put onto the ethernet frame by the sender which allows the receiver to associated that frame to a specific VLAN. The specification allow up to 4094 VLAN tags, from 1 to 4094. For more information on administering VLANs in Solaris 10, see Administering Virtual Local Area Networks in the Solaris 10 System Administrator Collection.

VLANs is a method to collapse multiple ethernet broadcast domains (whether hubs or switches) into a single network unit (usually a switch). [Typically, a single IP subnet, such as 192.168.54.0/24, is on a broadcast domain. Within such a switch frame, you can have a large number of virtual switches, consolidating network infrastructure and still isolating broadcast domains. Often, the use of VLANs is completely hidden from the systems tied to the switch, as a port on the switch is configured for only one VLAN. With VLAN tagging, a single port can allow a system to connect to multiple VLAns, and therefore multiple networks. Both the switch and the system must be configured for VLAN tagging for this to work properly. VLAN tagging has been used for years, and is robust and reliable.

Any one network interface can have multiple VLANs configured for it, but a single VLAN ID can only exist once on each interface. Thus it is possible to put multiple networks or broadcast domains on a single interface. It is not possible to put more than one VLAN of any broadcast domain on a single interface. For example, you can put VLANs 111, 112, and 113 on interface bge1, but you can not put VLAN 111 on bge1 more than once. You can, however, put VLAN 111 on interfaces bge1 and bge2.

Using the case shown above, if the three web servers are on the same network, say 10.1.111.0/24, you would want to have three interfaces that are all connected to a VLAN capable switch, and configure each interface with a VLAN tag that is the same as the VLAN ID on the switch.

For example, if the VLAN tag is 111 and the interfaces are bge1 through bge3, the link names you would assign to the three web servers would be bge111001, bge111002, and bge111003.

Introducing zones into the setup, the web servers can be run in three separate zones, and with exclusive IP Instances, they can be totally separate and each assigned a VLAN-tagged interface. Web Server 1 could have bge111001, Web Server 2 could have bge111002, and Web Server 3 could have bge111003.

global# zonecfg -z web1 info net
net:
        address not specified
        physical: bge111001

global# zonecfg -z web2 info net
net:
        address not specified
        physical: bge111002

global# zonecfg -z web3 info net
net:
        address not specified
        physical: bge111003

Within the zones, you could configure IP addresses 10.1.111.1/24 through 10.1.111.3/24.

Similarly, for the authentication tier, using VLAN ID 112, you could assign the zones auth1 through auth3 to bge112001, bge112002, and bge112003,respectively. And for application servers app1 and app2 on VLAN ID 113, bge113001 and bge113002. This can be repeated until some limit is reached, whether it is network bandwidth, system resource limits, or the maximum number of concurrent VLANs on either the switch or Solaris.

This configuration could look like the following diagram.

Web Server 1, Auth Server 1, and Application Server 1 share the use of NIC1, yet are all on different VLANs (111, 112, and 113, respectively). The same for instances 2 and 3, except that there is no third application server. All traffic between the three web servers will stay within the switch, as will traffic between the authentication servers. Traffic between the tiers is passed between the IP networks by the router. NICg is showing that the global zone also has a network interface.

Using this technique, the maximum number of zones with exclusive IP Instances you could deploy on a single system that are on the same subnet is limited to the number of interfaces that are capable of doing VLAN tagging. In the above example, with three bge interfaces on the system, the maximum number of exclusive zones on a single subnet would be three. (I have intentionally reserved bge0 for the global zone, but it would be possible to use it as well, making sure the global zone uses a different VLAN ID altogether, such as 1 or 2.)

About

stw

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today