Wednesday Jun 08, 2016

Using an https Keystore for ZFS Encryption

Overview

Recently I wrote about how to enable ZFS encryption for your home directory, in a way that accepts the wrapping key when first logging into the system. This works when it is your home directory. But what about other file systems or pools that you want to encrypt and you want to mount without intervention after a system reboot?

This discussion is about how to provide a wrapping key using an HTTPS service. For more details look at zfs_encrypt(1M).

As I often do, these examples use Solaris Zones. I am running Solaris 11.3 SRU 07. One Zone is the HTTPS server, and the second Zone is where I create the ZFS File Systems.

Configuring the HTTPS Service

Installing the Apache Web Server

The first step is to install the Apache web server package. Zones use the package group solaris-small-server by default, which does not include the Apache web server package.
root@myhttps:~# pkg install apache-22 
           Packages to install:  7
           Mediators to change:  3
            Services to change:  2
       Create boot environment: No
Create backup boot environment: No

DOWNLOAD                                PKGS         FILES    XFER (MB)   SPEED
Completed                                7/7     1035/1035      9.7/9.7 28.1M/s

PHASE                                          ITEMS
Installing new actions                     1241/1241
Updating package state database                 Done 
Updating package cache                           0/0 
Updating image state                            Done 
Creating fast lookup database                   Done 
Updating package cache                           3/3 
root@myhttps:~# 

Configuring SSL in Apache httpd.conf

I must extend the default HTTP configuration to enable the SSL service. A good configuration file is in the "sample-conf.d" directory.
root@myhttps:~# cd /etc/apache2/2.2/ 
root@myhttps:/etc/apache2/2.2# ls 
conf.d          envvars         httpd.conf      magic           
mime.types      original        samples-conf.d
root@myhttps:/etc/apache2/2.2# 
I like to save the original, especially if I want to show the differences easily. And I will append some comments to see where "ssl.conf" starts.
root@myhttps:/etc/apache2/2.2# cp -p httpd.conf Httpd.conf.orig 
root@myhttps:/etc/apache2/2.2# 

root@myhttps:/etc/apache2/2.2# echo "###
> ### End of Original httpd.conf
> ###
> " >> httpd.conf 
root@myhttps:/etc/apache2/2.2# 

root@myhttps:/etc/apache2/2.2# cat samples-conf.d/ssl.conf >> httpd.conf 
root@myhttps:/etc/apache2/2.2# 

root@myhttps:/etc/apache2/2.2# cp -p httpd.conf Httpd.conf.ssl.orig 
root@myhttps:/etc/apache2/2.2# 
After modifying for my configuration here are the differences.
root@myhttps:/etc/apache2/2.2# diff httpd.conf Httpd.conf.ssl.orig  
47c47
< #Listen 80 
---
> Listen 80
107c107
< ServerName 192.168.1.180 
---
> ServerName 127.0.0.1
533,534c533
< #ServerName 127.0.0.1:443
< ServerName 192.168.1.180:443 
---
> ServerName 127.0.0.1:443
553,554c552
< #SSLCertificateFile "/etc/apache2/2.2/server.crt"
< SSLCertificateFile "/etc/apache2/2.2/host180.crt" 
---
> SSLCertificateFile "/etc/apache2/2.2/server.crt"
564,565c562
< #SSLCertificateKeyFile "/etc/apache2/2.2/server.key"
< SSLCertificateKeyFile "/etc/apache2/2.2/host180.key" 
---
> SSLCertificateKeyFile "/etc/apache2/2.2/server.key"
root@myhttps:/etc/apache2/2.2# 
I replaced "server." with "host180." because I want to make managing my files easier. You can leave the "server" version and update the file names below. I also turned off port 80, for http access, to prevent sending data in clear text.

Creating the Self Signed Root Certificate

First step is to create a Root Certificate. I am putting the files into the "CA.d" directory I create so I can easily see the difference between the CA files and later web server certificate(s). I am using the prefix "host180CA" to identify anything having to do the the Root Certificate.
root@myhttps:/etc/apache2/2.2# mkdir CA.d 
root@myhttps:/etc/apache2/2.2# 
root@myhttps:/etc/apache2/2.2# openssl genrsa -des3 -out CA.d/host180CA.key 2048 
Generating RSA private key, 2048 bit long modulus
.................+++
....................................+++
e is 65537 (0x10001)
Enter pass phrase for CA.d/host180CA.key: XXX 
Verifying - Enter pass phrase for CA.d/host180CA.key: XXX 
root@myhttps:/etc/apache2/2.2# 

root@myhttps:/etc/apache2/2.2# openssl req -x509 -new -nodes -key CA.d/host180CA.key \
-sha256 -days 1024 -out CA.d/host180CA.pem 
Enter pass phrase for CA.d/host180CA.key:
You are about to be asked to enter information that will be incorporated
into your certificate request.
What you are about to enter is what is called a Distinguished Name or a DN.
There are quite a few fields but you can leave some blank
For some fields there will be a default value,
If you enter '.', the field will be left blank.
-----
Country Name (2 letter code) []:US 
State or Province Name (full name) []:NJ 
Locality Name (eg, city) []:MyTown 
Organization Name (eg, company) []:Oracle 
Organizational Unit Name (eg, section) []:SE 
Common Name (e.g. server FQDN or YOUR name) []:192.168.1.180 
Email Address []:steffen@steffen.steffen 
root@myhttps:/etc/apache2/2.2# 

Creating the Server Certificate

Now I create the certificates for this web server. I will be referencing the CA.d files from above. The server Certificates have the prefix "host180" because my IP address is 192.168.1.180. I am doing this to make it easier to recognize files.
root@myhttps:/etc/apache2/2.2# openssl genrsa -out host180.key 2048 
Generating RSA private key, 2048 bit long modulus
........................+++
...........................................+++
e is 65537 (0x10001)
root@myhttps:/etc/apache2/2.2# 
root@myhttps:/etc/apache2/2.2# openssl req -new -key host180.key -out host180.csr 
You are about to be asked to enter information that will be incorporated
into your certificate request.
What you are about to enter is what is called a Distinguished Name or a DN.
There are quite a few fields but you can leave some blank
For some fields there will be a default value,
If you enter '.', the field will be left blank.
-----
Country Name (2 letter code) []:US 
State or Province Name (full name) []:NJ 
Locality Name (eg, city) []:MyHost180 
Organization Name (eg, company) []:Oracle 
Organizational Unit Name (eg, section) []:SEweb 
Common Name (e.g. server FQDN or YOUR name) []:192.168.1.180 
Email Address []:host180@steffen.steffen 

Please enter the following 'extra' attributes
to be sent with your certificate request
A challenge password []:
An optional company name []:
root@myhttps:/etc/apache2/2.2# 

root@myhttps:/etc/apache2/2.2# openssl x509 -req -in host180.csr -CA CA.d/host180CA.pem \
-CAkey CA.d/host180CA.key -CAcreateserial -out host180.crt -days 1000 -sha256 
Signature ok
subject=/C=US/ST=NJ/L=MyHost180/O=Oracle/OU=SEweb/CN=192.168.1.180/emailAddress=host180@steffen.steffen
Getting CA Private Key
Enter pass phrase for CA.d/host180CA.key: XXX 
root@myhttps:/etc/apache2/2.2# 
Here are all the files that end up getting created.
root@myhttps:/etc/apache2/2.2# ls -l 
total 398
drwxr-xr-x   2 root     root           4 Jun  8 17:33 CA.d
-rw-r--r--   1 root     root          17 Jun  8 17:39 CA.srl
drwxr-xr-x   2 root     sys            4 Jun  8 17:20 conf.d
-rw-r--r--   1 root     bin          896 Jun  8 17:20 envvars
-rw-r--r--   1 root     root        1306 Jun  8 17:39 host180.crt
-rw-r--r--   1 root     root        1058 Jun  8 17:37 host180.csr
-rw-r--r--   1 root     root        1675 Jun  8 17:36 host180.key
-rw-r--r--   1 root     bin        26114 Jun  8 17:29 httpd.conf
-rw-r--r--   1 root     bin        13673 Jun  8 17:20 Httpd.conf.orig
-rw-r--r--   1 root     bin        25975 Jun  8 17:26 Httpd.conf.ssl.orig
-rw-r--r--   1 root     bin        12958 Jun  8 17:20 magic
-rw-r--r--   1 root     bin        53011 Jun  8 17:20 mime.types
drwxr-xr-x   2 root     sys            3 Jun  8 17:20 original
drwxr-xr-x   2 root     sys           15 Jun  8 17:20 samples-conf.d
root@myhttps:/etc/apache2/2.2# ls CA.d/ 
host180CA.key  host180CA.pem
root@myhttps:/etc/apache2/2.2# 

Creating the ZFS Encryption Wrapping Key

I need a key that ZFS will use as the wrapping key. This is a short one. You may have some mechanism to create a longer one.
root@myhttps:/etc/apache2/2.2# pktool genkey keystore=file \
outkey=/var/apache2/2.2/htdocs/zfs-aes-256.key keytype=aes keylen=256 
root@myhttps:/etc/apache2/2.2# 
root@myhttps:/etc/apache2/2.2# chmod +r /var/apache2/2.2/htdocs/zfs-aes-256.key 
root@myhttps:/etc/apache2/2.2# ls -l /var/apache2/2.2/htdocs/zfs-aes-256.key 
-r--r--r--   1 root     root          32 Jun  8 17:41 /var/apache2/2.2/htdocs/zfs-aes-256.key
root@myhttps:/etc/apache2/2.2# 
By default the key is readable only by the user that creates it, in this case "root". If you don't make it readable by all, since Apache runs as "daemon" by default, you will not be able to access it over HTTP/HTTPS.

Starting the Web Server

Now that I have done all my configurations, lets start it up.
root@myhttps:/etc/apache2/2.2# svcs *apache* 
STATE          STIME    FMRI
disabled       17:20:17 svc:/network/http:apache22
root@myhttps:/etc/apache2/2.2# 
root@myhttps:/etc/apache2/2.2# svcadm enable apache22 
root@myhttps:/etc/apache2/2.2# 
root@myhttps:/etc/apache2/2.2# svcs *apache* 
STATE          STIME    FMRI
online         17:44:01 svc:/network/http:apache22
root@myhttps:/etc/apache2/2.2# 
One final check to make sure all services are running fine.
root@myhttps:/etc/apache2/2.2# svcs -x 
root@myhttps:/etc/apache2/2.2# 

root@myhttps:/etc/apache2/2.2# netstat -anf inet 
...
TCP: IPv4
   Local Address        Remote Address     Swind  Send-Q  Rwind  Recv-Q    State
-------------------- -------------------- ------- ------ ------- ------ -----------
127.0.0.1.5999             *.*                  0      0  128000      0 LISTEN
      *.111                *.*                  0      0  128000      0 LISTEN
      *.*                  *.*                  0      0  128000      0 IDLE
      *.111                *.*                  0      0  128000      0 LISTEN
      *.*                  *.*                  0      0  128000      0 IDLE
      *.22                 *.*                  0      0  128000      0 LISTEN
      *.22                 *.*                  0      0  128000      0 LISTEN
127.0.0.1.4999             *.*                  0      0  128000      0 LISTEN
127.0.0.1.25               *.*                  0      0  128000      0 LISTEN
127.0.0.1.587              *.*                  0      0  128000      0 LISTEN
      *.*                  *.*                  0      0  128000      0 IDLE
      *.443                *.*                  0      0  128000      0 LISTEN
      *.*                  *.*                  0      0  128000      0 IDLE
root@myhttps:/etc/apache2/2.2# 
Everything looks good. On to the Zone where I will do the ZFS work.

Creating an Encrypted File System using a Keystore via HTTPS

Adding Self Signed Certificate to an HTTPS Client

I need to do two steps to be able to access the https service. First, I need to load the certificate for the web server into the local CA directory. I get this certificate using the "openssl" command.

Here is the complete output. To end the command, enter " D". (It doesn't show up in the output.)

root@ezfs:~# openssl s_client -connect 192.168.1.180:443 
CONNECTED(00000003)
depth=0 C = US, ST = NJ, L = MyHost180, O = Oracle, OU = SEweb, CN = 192.168.1.180, emailAddress = host180@steffen.steffen
verify error:num=20:unable to get local issuer certificate
verify return:1
depth=0 C = US, ST = NJ, L = MyHost180, O = Oracle, OU = SEweb, CN = 192.168.1.180, emailAddress = host180@steffen.steffen
verify error:num=21:unable to verify the first certificate
verify return:1
---
Certificate chain
 0 s:/C=US/ST=NJ/L=MyHost180/O=Oracle/OU=SEweb/CN=192.168.1.180/emailAddress=host180@steffen.steffen
   i:/C=US/ST=NJ/L=MyTown/O=Oracle/OU=SE/CN=192.168.1.180/emailAddress=steffen@steffen.steffen
---
Server certificate
-----BEGIN CERTIFICATE-----
MIIDljCCAn4CCQCfpX0OhjSiMTANBgkqhkiG9w0BAQsFADCBiTELMAkGA1UEBhMC
VVMxCzAJBgNVBAgMAk5KMQ8wDQYDVQQHDAZNeVRvd24xDzANBgNVBAoMBk9yYWNs
ZTELMAkGA1UECwwCU0UxFjAUBgNVBAMMDTE5Mi4xNjguMS4xODAxJjAkBgkqhkiG
9w0BCQEWF3N0ZWZmZW5Ac3RlZmZlbi5zdGVmZmVuMB4XDTE2MDYwODIxMzkwNloX
DTE5MDMwNTIxMzkwNlowgY8xCzAJBgNVBAYTAlVTMQswCQYDVQQIDAJOSjESMBAG
A1UEBwwJTXlIb3N0MTgwMQ8wDQYDVQQKDAZPcmFjbGUxDjAMBgNVBAsMBVNFd2Vi
MRYwFAYDVQQDDA0xOTIuMTY4LjEuMTgwMSYwJAYJKoZIhvcNAQkBFhdob3N0MTgw
QHN0ZWZmZW4uc3RlZmZlbjCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoCggEB
AM4wzBQRsqz2lwH1sauTSnx6fpig40EBaFHRLCblvWgAYLgTY1ccV39X2zryjKIM
u0vDnBALQOu4/IMZ+7tcZNnSBPrKhqC/YgwWmY0GINvbFG0AQ1aIm/KEqeHLzVhu
EAjGc1tUvhWT1MxvooHshtR1KGzZv9gq6fnBprqOz0Es6VNWPLX3rmkryFhlE+tY
HOhEgBAMEhiQ6Ait/pORMGG5XRaLkXsXNjrEHD8YXD4VbvPl8GoMQCwSZ9M3DFA1
8IfpDaB5ByUihhFWV2NxcBBfCqCBd6v0bdh1nyAJ5zhZGmEHYztt13WqiJ315pgy
yhXnAne0SDycdRQrxpPs21ECAwEAATANBgkqhkiG9w0BAQsFAAOCAQEAY+2RTjLy
lWaaHO4xDYGuDW8k2XkxsH+BkRcDtRM0g1iliHgQLSxGqdsKr4fK4WWC7Vbfm0Cb
l47T3ny+rNvyT6ac/VhfwI/GDIOGwV+mzoVio5QlZh601gclDv5M4j8633Wr/SCc
c8ZFB6FOAfqaLDtZryfHCUbppL2AnSPY6JFQG4Cv5Uo/nTTs4vyL4JwRl/cQNLXY
6GCQRMjAwrfdjj2wBczrbEK1qzu0gD4crkB/XpyJFZq32RSvWtE3nVV9GU93ErLY
C1BxQHvrYYWVlIv59sIQ4DYec0b/mxs9HnjHVA4sveTg1CjUXRY+eYpPF7OlHa9v
2EV4l7T2IB0ZLg==
-----END CERTIFICATE----- 
subject=/C=US/ST=NJ/L=MyHost180/O=Oracle/OU=SEweb/CN=192.168.1.180/emailAddress=host180@steffen.steffen
issuer=/C=US/ST=NJ/L=MyTown/O=Oracle/OU=SE/CN=192.168.1.180/emailAddress=steffen@steffen.steffen
---
No client certificate CA names sent
---
SSL handshake has read 2055 bytes and written 463 bytes
---
New, TLSv1/SSLv3, Cipher is DHE-RSA-AES256-GCM-SHA384
Server public key is 2048 bit
Secure Renegotiation IS supported
Compression: NONE
Expansion: NONE
SSL-Session:
    Protocol  : TLSv1.2
    Cipher    : DHE-RSA-AES256-GCM-SHA384
    Session-ID: 552375FF9881568181BC0DCEBBD238D913DCB55381FD9A2ADED7413B00AC9078
    Session-ID-ctx: 
    Master-Key: F8D5B3E7C4FF7B8396FAEC8FAEBA0865E8790335E1A09B9703F217125C5D3EB7220D79E24F4510C35F8E500DFFC1C06D
    Key-Arg   : None
    PSK identity: None
    PSK identity hint: None
    SRP username: None
    TLS session ticket lifetime hint: 300 (seconds)
    TLS session ticket:
    0000 - c1 35 38 cb eb 88 92 85-28 50 7e c5 cc 4f f8 4d   .58.....(P~..O.M
    0010 - 64 a7 61 7f 8f bb 09 8b-c3 b6 0b fe a4 1f 50 ce   d.a...........P.
    0020 - d5 b2 0c 82 97 9a 86 69-d2 76 ea d1 19 f3 40 fb   .......i.v....@.
    0030 - 0e 95 6b cd 9d e2 09 f5-de 52 bb 14 c7 f9 fc 6f   ..k......R.....o
    0040 - 1c 39 7f e3 3b 9a 9b 95-be 79 df 39 19 fc f3 6f   .9..;....y.9...o
    0050 - 6a 12 7a 5b b5 ea 1e 03-6f 44 01 b5 74 8b 7c 4f   j.z[....oD..t.|O
    0060 - 7a 61 8a d0 39 bb 7f 72-f1 99 81 57 57 2d b3 e1   za..9..r...WW-..
    0070 - 70 82 1b 87 33 35 95 15-62 05 07 46 bc 6f ab f1   p...35..b..F.o..
    0080 - c6 06 5a c3 4d 86 9d d0-db 2f 9a d4 70 97 98 9b   ..Z.M..../..p...
    0090 - 41 74 bb dd 03 33 7c dd-c2 20 ad bc ac c1 29 ad   At...3|.. ....).
    00a0 - de dd 72 8a 8b 32 74 10-8d 9b 45 38 f5 27 a3 d3   ..r..2t...E8.'..
    00b0 - e1 f6 d1 d6 0b 07 6e 08-cf 76 2c 7a 51 25 c6 b3   ......n..v,zQ%..

    Start Time: 1465422516
    Timeout   : 300 (sec)
    Verify return code: 21 (unable to verify the first certificate)
---
DONE
root@ezfs:~# 
I need the text between the "BEGIN" and "END CERTIFICATE" lines, including those lines. I send the output to a file, and then remove the content except the "CERTIFICATE" part.
root@ezfs:~# openssl s_client -connect 192.168.1.180:443 > /tmp/host180.pem 
depth=0 C = US, ST = NJ, L = MyHost180, O = Oracle, OU = SEweb, CN = 192.168.1.180, emailAddress = host180@steffen.steffen
verify error:num=20:unable to get local issuer certificate
verify return:1
depth=0 C = US, ST = NJ, L = MyHost180, O = Oracle, OU = SEweb, CN = 192.168.1.180, emailAddress = host180@steffen.steffen
verify error:num=21:unable to verify the first certificate
verify return:1
DONE
root@ezfs:~# 

root@ezfs:~# vi /tmp/host180.pem 
root@ezfs:~# 
root@ezfs:~# cat /tmp/host180.pem 
-----BEGIN CERTIFICATE-----
MIIDljCCAn4CCQCfpX0OhjSiMTANBgkqhkiG9w0BAQsFADCBiTELMAkGA1UEBhMC
VVMxCzAJBgNVBAgMAk5KMQ8wDQYDVQQHDAZNeVRvd24xDzANBgNVBAoMBk9yYWNs
ZTELMAkGA1UECwwCU0UxFjAUBgNVBAMMDTE5Mi4xNjguMS4xODAxJjAkBgkqhkiG
9w0BCQEWF3N0ZWZmZW5Ac3RlZmZlbi5zdGVmZmVuMB4XDTE2MDYwODIxMzkwNloX
DTE5MDMwNTIxMzkwNlowgY8xCzAJBgNVBAYTAlVTMQswCQYDVQQIDAJOSjESMBAG
A1UEBwwJTXlIb3N0MTgwMQ8wDQYDVQQKDAZPcmFjbGUxDjAMBgNVBAsMBVNFd2Vi
MRYwFAYDVQQDDA0xOTIuMTY4LjEuMTgwMSYwJAYJKoZIhvcNAQkBFhdob3N0MTgw
QHN0ZWZmZW4uc3RlZmZlbjCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoCggEB
AM4wzBQRsqz2lwH1sauTSnx6fpig40EBaFHRLCblvWgAYLgTY1ccV39X2zryjKIM
u0vDnBALQOu4/IMZ+7tcZNnSBPrKhqC/YgwWmY0GINvbFG0AQ1aIm/KEqeHLzVhu
EAjGc1tUvhWT1MxvooHshtR1KGzZv9gq6fnBprqOz0Es6VNWPLX3rmkryFhlE+tY
HOhEgBAMEhiQ6Ait/pORMGG5XRaLkXsXNjrEHD8YXD4VbvPl8GoMQCwSZ9M3DFA1
8IfpDaB5ByUihhFWV2NxcBBfCqCBd6v0bdh1nyAJ5zhZGmEHYztt13WqiJ315pgy
yhXnAne0SDycdRQrxpPs21ECAwEAATANBgkqhkiG9w0BAQsFAAOCAQEAY+2RTjLy
lWaaHO4xDYGuDW8k2XkxsH+BkRcDtRM0g1iliHgQLSxGqdsKr4fK4WWC7Vbfm0Cb
l47T3ny+rNvyT6ac/VhfwI/GDIOGwV+mzoVio5QlZh601gclDv5M4j8633Wr/SCc
c8ZFB6FOAfqaLDtZryfHCUbppL2AnSPY6JFQG4Cv5Uo/nTTs4vyL4JwRl/cQNLXY
6GCQRMjAwrfdjj2wBczrbEK1qzu0gD4crkB/XpyJFZq32RSvWtE3nVV9GU93ErLY
C1BxQHvrYYWVlIv59sIQ4DYec0b/mxs9HnjHVA4sveTg1CjUXRY+eYpPF7OlHa9v
2EV4l7T2IB0ZLg==
-----END CERTIFICATE-----
root@ezfs:~# 
I copy the file into the Certificate Authority directory.
root@ezfs:~# cp /tmp/host180.pem /etc/certs/CA/ 
root@ezfs:~# 
Because this is a Self Signed Certificate, I also need the file I use to sign certificates. That is on the web server.
root@ezfs:~# scp guest@192.168.1.180:/etc/apache2/2.2/CA.d/host180CA.pem /tmp 
The authenticity of host '192.168.1.180 (192.168.1.180)' can't be established.
RSA key fingerprint is 1b:62:9b:5c:42:f9:44:c9:d1:81:99:c4:e3:c0:3f:0f.
Are you sure you want to continue connecting (yes/no)? yes 
Warning: Permanently added '192.168.1.180' (RSA) to the list of known hosts.
Password: XXX 
host180CA.pem        100% |**********************************************|  1415       00:00    
root@ezfs:~# 

root@ezfs:~# cp /tmp/host180CA.pem /etc/certs/CA/ 
root@ezfs:~# 
With both files in the directory, I have the CA service refresh to read the files.
root@ezfs:~# svcs *cert* 
STATE          STIME    FMRI
online         17:01:51 svc:/system/ca-certificates:default
root@ezfs:~# 

root@ezfs:~# svcadm refresh ca-certificates 
root@ezfs:~# 

root@ezfs:~# svcs *cert* 
STATE          STIME    FMRI
online         17:53:56 svc:/system/ca-certificates:default
root@ezfs:~# 
Any easy way I found to verify that this works is the "wget(1)" command. Its output is useful in understanding when my certificates are not working as well.
root@ezfs:~# (cd /tmp ; wget https://192.168.1.180/zfs-aes-256.key ) 
--2016-06-08 17:54:10--  https://192.168.1.180/zfs-aes-256.key
Connecting to 192.168.1.180:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 32 [text/plain]
Saving to: ‘zfs-aes-256.key’

zfs-aes-256.key          100%[=================================>]      32  --.-KB/s   in 0s     

2016-06-08 17:54:10 (3.57 MB/s) - ‘zfs-aes-256.key’ saved [32/32]

root@ezfs:~# 
root@ezfs:~# rm /tmp/zfs-aes-256.key 
root@ezfs:~# 
I delete the file right away as I only want it accessible via https.

Create the ZFS File Systems

Now to the real tack at hand, creating a ZFS file system with encryption and the wrapping key accessed using https. I also create one that requires manual input to show the difference. I am using the "rpool/export" directory as my base.
root@ezfs:~# zfs list 
NAME                              USED  AVAIL  REFER  MOUNTPOINT
rpool                            61.6M   156G   144K  /rpool
rpool/ROOT                       58.1M   156G   144K  legacy
rpool/ROOT/solaris-0             58.1M   156G  1.48G  /
rpool/ROOT/solaris-0/var         2.60M   156G   174M  /var
rpool/VARSHARE                      3M   156G  2.76M  /var/share
rpool/VARSHARE/pkg                296K   156G   152K  /var/share/pkg
rpool/VARSHARE/pkg/repositories   144K   156G   144K  /var/share/pkg/repositories
rpool/export                      360K   156G   152K  /export 
rpool/export/home                 256K   156G   152K  /export/home
rpool/export/home/guest           152K   156G   152K  /export/home/guest
root@ezfs:~# 

root@ezfs:~# zfs create -o encryption=on \
-o keysource=passphrase,prompt rpool/export/prompt 
Enter passphrase for 'rpool/export/prompt': XXX 
Enter again: XXX 
root@ezfs:~# 

root@ezfs:~# zfs create -o encryption=on \
-o keysource=raw,https://192.168.1.180:443/zfs-aes-256.key rpool/export/https 
root@ezfs:~# 
I put some data into the two file system to test with later.
root@ezfs:~# date > /export/https/date 
root@ezfs:~# date > /export/prompt/date 
root@ezfs:~# 
root@ezfs:~# ls /export/* 
/export/home:
guest

/export/https:
date 

/export/prompt:
date 
root@ezfs:~# more /export/*/date 
::::::::::::::
/export/https/date
::::::::::::::
Wednesday, June  8, 2016 05:59:59 PM EDT
::::::::::::::
/export/prompt/date
::::::::::::::
Wednesday, June  8, 2016 06:00:07 PM EDT
root@ezfs:~# 

root@ezfs:~# zfs list 
NAME                              USED  AVAIL  REFER  MOUNTPOINT
rpool                            61.9M   156G   144K  /rpool
rpool/ROOT                       58.1M   156G   144K  legacy
rpool/ROOT/solaris-0             58.1M   156G  1.48G  /
rpool/ROOT/solaris-0/var         2.60M   156G   174M  /var
rpool/VARSHARE                      3M   156G  2.76M  /var/share
rpool/VARSHARE/pkg                296K   156G   152K  /var/share/pkg
rpool/VARSHARE/pkg/repositories   144K   156G   144K  /var/share/pkg/repositories
rpool/export                      720K   156G   168K  /export
rpool/export/home                 256K   156G   152K  /export/home
rpool/export/home/guest           152K   156G   152K  /export/home/guest
rpool/export/https                172K   156G   172K  /export/https
rpool/export/prompt               172K   156G   172K  /export/prompt 
root@ezfs:~# 
root@ezfs:~# halt 

[Connection to zone 'ezfs' pts/10 closed]

Validating Hands-Free Operation After a Reboot

The keys for encrypted ZFS file system are only required when they are first accessed. I am using Solaris Zones, and don't want to reboot my system. So to simulate a reboot I "unload" the keys for all the file system in the zone. (There is only one file system with a key, however, this would do all if there were more than one.)
root@global# zfs key -u -r pool1/zones/ezfs 
root@global# 
root@global# zoneadm -z ezfs boot 
root@global# 

Test and Manually Mount the "prompt" File System

Once the zone boots, lets check what data is available.
root@global# zlogin ezfs 
[Connected to zone 'ezfs' pts/10]
Last login: Wed Jun  8 17:05:53 2016 on pts/10
Oracle Corporation      SunOS 5.11      11.3    March 2016
root@ezfs:~# 
root@ezfs:~# ls /export/* 
/export/home:
guest

/export/https:
date 

/export/prompt:
root@ezfs:~# 
root@ezfs:~# more /export/*/date 
Wednesday, June  8, 2016 05:59:59 PM EDT
root@ezfs:~# 
As you can see, only the "https" directory shows the "date" file. I manually mount the "prompt" file system.
root@ezfs:~# zfs mount rpool/export/prompt 
Enter passphrase for 'rpool/export/prompt': XXX  
root@ezfs:~# 
root@ezfs:~# more /export/*/date 
::::::::::::::
/export/https/date
::::::::::::::
Wednesday, June  8, 2016 05:59:59 PM EDT
::::::::::::::
/export/prompt/date
::::::::::::::
Wednesday, June  8, 2016 06:00:07 PM EDT
root@ezfs:~# 
Now both "date" files are available.

Summary

This was a quick and simple walk through of the steps to automatically mount an encrypted file system without using a local keysource file. Thank you and good luck with you ZFS experiences!

Steffen

Appreciations

Thanks to DarrenM for his repeated replies to my email requests for help, and to BartS for his quick reply as well.

Thank you to "The Data Center Overlords" for the high level steps that got me started on how to set up my own Certificate Authority and server certificates.

Revision History

(Other than minor typographical changes)

2016.06.08: Posted

2016.06.08: Created

Thursday May 05, 2016

Encrypting my Home Directory on ZFS

Overview

I like to run Solaris on my work desktops because I have all the Solaris features at my fingertips. This included the manual pages, Solaris Zones, Solaris networking including VNICs, and I just find the Solaris GNOME desktop to the most easy for me to use for basic email, browsing, terminal windows, and the like.

Because I might be putting some information on my desktop that I'd rather not leave when the disk drive leaves, I make an effort to encrypt my home directory. Because I do this relatively infrequently, I don't remember the steps, so search for and I reference Darren Moffat's blog. Unfortunately, it was written in 2011 when Solaris 11 11/11 delivered ZFS encryption, and it seems some files have changed slightly.

To make it easier for me to reference, and to add some additional features, I did some repeated testing of modification of the PAM module and am posting the steps in this blog entry. I make no effort to explain PAM, as I am not that versed in it.

The Default Configuration

I am using one of my desktops to write this, and I will use as Solaris Zone to show what a fresh installation looks like. Darren's example shows how to enable encryption with the GNOME Display Manager (GDM.) I will extend this to work with console or ssh login. Testing the GDM configuration does require me to log out of my desktop, and is a bit more intrusive for me to test and show. Testing and documenting console and ssh logins are easy with a Zone.

root@pamzone:~# cd /etc/pam.d 
root@pamzone:/etc/pam.d# 
root@pamzone:/etc/pam.d# ls 
cron           gdm-autologin  other          pfexec
cups           login          passwd         tpdlogin
root@pamzone:/etc/pam.d# 
Here are default files in a Solaris 11.3 default installation using the Live Media. I highlight the files I will be changes. In addition, I will be adding a gdm file that is not yet there.

Modifying the Configuration

Because I am a bit conservative, even though this is a Zone, I will make a best effort to be able to revert to the original configuration. Also, I can highlight differences.
root@pamzone:/etc/pam.d# beadm create initial 
root@pamzone:/etc/pam.d# 
root@pamzone:/etc/pam.d# cp -p login login.orig 
root@pamzone:/etc/pam.d# cp -p other other.orig 
root@pamzone:/etc/pam.d# 
I modify the "login" and "other" files based on the changes Darren put into the "/etc/pam.conf" file. The GDM specific entries go into "gdm".
root@pamzone:/etc/pam.d# diff login login.orig 
14,16d13
< # 2016.05.04 Added for encrypting user's home directory
< #            Create a new home directory if it does not exist
< auth required         pam_zfs_key.so.1 create homes=rpool/export/home 
root@pamzone:/etc/pam.d# 
root@pamzone:/etc/pam.d# diff other other.orig 
14,16d13
< # 2016.05.04 Added for encrypting user's home directory
< #            This allows new account without coming in on console
< auth required         pam_zfs_key.so.1 create homes=rpool/export/home
49,51d45
< # 2016.05.04 Added for encrypting user's home directory
< #            Update the ZFS encryption wrapping key when the user changes their password
< password requisite    pam_zfs_key.so.1 homes=rpool/export/home 
root@pamzone:/etc/pam.d# 
root@pamzone:/etc/pam.d# cat gdm 
# 2016.05.04 Created based on https://blogs.oracle.com/darren/entry/user_user_home_directory_encryption
auth requisite          pam_authtok_get.so.1
auth required           pam_unix_cred.so.1
auth required           pam_unix_auth.so.1
# 2016.05.04 Added for encrypting user's home directory
#            Create a new home directory if it does not exist
auth required           pam_zfs_key.so.1 create homes=rpool/export/home
# 2016.05.04 End of ZFS encrytion changes
auth required           pam_unix_auth.so.1 
While Darren shows putting the ZFS encryption features into "/etc/pam.conf" I am putting them into the per-service files in "/etc/pam.d/" as the /etc/pam.conf comments recommend. This has required some testing and retesting for me to get this fully working, which is why I am creating this blog.

Modifying the Configuration

The way to test this is to create a new user. Since I am doing this in a Solaris Zone I can only test text console and network logins. I will demonstrate both, and come back later to show GDM.

First steps are to create the users and to force them to enter a new password when they first log in.

root@pamzone:~# useradd -g 10 -c "user1" -d /export/home/user1 user1 
root@pamzone:~# useradd -g 10 -c "user2" -d /export/home/user2 user2 
root@pamzone:~# 
root@pamzone:~# passwd user1 
New Password: xxx
Re-enter new Password: xxx
passwd: password successfully changed for user1
root@pamzone:~# passwd user2 
New Password: xxx
Re-enter new Password: xxx
passwd: password successfully changed for user2
root@pamzone:~# 
root@pamzone:~# passwd -f user1 
passwd: password information changed for user1
root@pamzone:~# passwd -f user2 
passwd: password information changed for user2
root@pamzone:~# 
The "-f" option forces the user to enter a new password on their next login by expiring it. Thus only the user knows the password for the wrapping key.

Testing the New Users

Now I will log into the Zone's console from the Global Zone to show the console login step.
admin@global:~$ pfexec zlogin -C pamzone 
[Connected to zone 'pamzone' console]

pamzone console login: user1 
Password: xxx
Choose a new password.
New Password: xxx
Re-enter new Password: xxx
login: password successfully changed for user1
Creating home directory with encryption=on.
Your login password will be used as the wrapping key.
Oracle Corporation      SunOS 5.11      11.3    February 2016
-bash-4.1$ 
-bash-4.1$ pwd 
/export/home/user1
-bash-4.1$ /usr/sbin/zfs get encryption rpool/export/home/user1 
NAME                     PROPERTY    VALUE  SOURCE
rpool/export/home/user1  encryption  on     local 
-bash-4.1$ 
As you can see, a home directory is created automatically, and encryption is set to "on".

The second test is to login in remotely. I am simulating that by going to localhost just for convenience.

root@pamzone:~# ssh user2@localhost 
The authenticity of host 'localhost (::1)' can't be established.
RSA key fingerprint is 1d:e5:ff:2d:1f:b2:db:a0:0a:ff:3b:53:db:e6:3c:68.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
Password: xxx
Warning: Your password has expired, please change it now.
New Password: xxx
Re-enter new Password: xxx
sshd-kbdint: password successfully changed for user2
Creating home directory with encryption=on.
Your login password will be used as the wrapping key.
Oracle Corporation      SunOS 5.11      11.3    February 2016
-bash-4.1$ 
-bash-4.1$ pwd 
/export/home/user2 
-bash-4.1$ /usr/sbin/zfs get encryption rpool/export/home/user1 
NAME                     PROPERTY    VALUE  SOURCE
rpool/export/home/user1  encryption  on     local 
-bash-4.1$ 
Again, the ZFS encryption property validates that encryption is on.

Changing a Password

It is good to know I can have my home directory encrypted automatically when I log in the first time. What happens when it is time for me to change my password? Let's see.
-bash-4.1$ passwd 
passwd: Changing password for user2
Enter existing login password: xxx
New Password: xxx
Re-enter new Password: xxx
passwd: password successfully changed for user2
ZFS Key change for rpool/export/home/user2 successful 
-bash-4.1$ 
As you can see, the ZFS wrapping key is updated when I run the "passwd(1)" command.

Mounting an Encrypted File System/Home Directory on Reboot

The above steps created and mounted the users' home directories. Let us take a look what happens on a reboot. The experience is different in a Zone reboot than it is on a system reboot.
root@auth:~# zfs get -r encryption rpool/export/home 
NAME                     PROPERTY    VALUE  SOURCE
rpool/export/home        encryption  off    -
rpool/export/home/guest  encryption  off    -
rpool/export/home/user1  encryption  on     local
rpool/export/home/user2  encryption  on     local
root@auth:~# 
root@auth:~# ls /export/home/*/ 
/export/home/guest/:

/export/home/user1/:
test1

/export/home/user2/:
test2
root@auth:~# 
root@auth:~# reboot 

[Connection to zone 'auth' pts/3 closed]
root@global:~# 
...
root@global:~# zlogin auth 
[Connected to zone 'auth' pts/3]
Oracle Corporation	SunOS 5.11	11.3	February 2016
root@auth:~# 
root@auth:~# ls /export/home/*/ 
/export/home/guest/:

/export/home/user1/:
test1

/export/home/user2/:
test2 
root@auth:~# 
root@auth:~# ls /export/home/*/test* 
/export/home/user1/test1  /export/home/user2/test2 
root@auth:~# 
Though the Zone was rebooted, it is not necessary to provide wrapping keys for the encrypted file systems. Now let's see what happens when the system reboots.
root@auth:~# ls /export/home/*/test* 
/export/home/*/test*: No such file or directory 
root@auth:~# 
root@auth:~# ssh user1@localhost 
Password: xxx
Oracle Corporation      SunOS 5.11      11.3    February 2016
-bash-4.1$ ls 
test1 
-bash-4.1$ exit 
logout
Connection to localhost closed.
root@auth:~# 
root@auth:~# ls /export/home/*/test* 
/export/home/user1/test1 
root@auth:~# 
root@auth:~# zfs mount rpool/export/home/user2 
Enter passphrase for 'rpool/export/home/user2': xxx
root@auth:~# 
root@auth:~# ls /export/home/*/test* 
/export/home/user1/test1  /export/home/user2/test2 
root@auth:~# 
Upon a system reboot it is necessary to provide the wrapping key. With the custom PAM setup, when user1 logs in, the key is provided to mount user1's home directory. A second way of providing the key is when performing a "zfs mount" operation. In the second case, the user with the privileges to run the command must know the wrapping key. I have done that when I access my system remotely after a reboot, and before I added the extra option to mount on remote access.

Testing the Configuration when Logging In on a Desktop

In order to capture the desktop login experience, I need to enable remote GNOME login. I followed the steps at Setting Up Remote Desktop Access Using VNC in the Solaris 11.3 Desktop Admiminstrator's Guide and tips on Calkins' Blog.

Because desktops need to access devices not available in a Solaris Zone, I created another user user3 in the Global Zone on the system. (First I create a new Boot Environment and reboot into that, so I can delete the changes to the Global Zone.)

Once all set up, I log in.

User Login

Because I force the user to enter a new password, I am prompted to do so.

Expired Password

I enter it (twice.)

Enter New Password

I am told it is successful.

Password Change Successful

As with on the console or a network connection, the system tells me that I have successfully encrypted.

Encryption is On with Login Password

And I can verify that by looking at the ZFS encryption attribute. It is on!

ZFS Encryption is On

So this shows the GNOME version of first login and changing the password to set the ZFS encryption wrapping key.

In Summary

Now you can encrypt your home directory and make sure the wrapping key is up to date whenever you change your password.

I will add one small item since we are changing the PAM configuration files. In Solaris 11.3, when the system is rebooted for any reason, a new feature asked for by some customers is to remind the user of the last login in. This display disappears after ten second or so. To get rid of it quicker you might have to click on OK. AlanC at The Observatory writes how to get rid of that. Here is how do that.

root@global:~# grep nowarn /etc/pam.d/gdm 
session required	pam_unix_session.so.1	nowarn 
root@global:~# 
Hopefully this all makes someone's life a bit easier and more secure.

Steffen

Revision History

(Other than minor typographical changes)

2016.05.10: Corrected "other" to "gdm" in how to avoid last login warnings

2016.05.06: Small changes to my comments and descriptions

2016.05.05: Posted

2016.05.03: Created

Friday Apr 22, 2016

Using Aggregations and VLANs with LDoms and Zones

Often people ask how to use link aggregations and VLANs with Oracle VM Server for SPARC (Logical Domain or LDoms). My goal here is to give a brief description and steps how to configure a link aggregation in a Serivce Domain (in this case also the Control Domain) and then set up different VLAN configurations.

I am showing this with Solaris 11.3, though the steps will work for any Solaris 11. Due to networking differences in Solaris 10, the principles will apply yet the steps will be different.

My Setup

I am using a T4-1 for the system to demonstrate the networking and LDom set, a T5120 as are remote system on the network, and a Netgear GS716T Smart Switch between the two. The GS716T can do link aggregation, but not IEEE 802.3ad LACP. Solaris supports link aggregation with or without LACP, and since Solaris 11.1, also using Data Link MultiPathing (DLMP.) The functionality and steps are almost identical except in some options when setting up the link aggregation.

I find it useful for my understanding to see not just command line input. I also like to see the output, and validation that steps I perform actually do something. When doing network configurations, I prefer to see network traffic. This session will include all of that. For networking that requires a second system, and I will show the setup of that as well. I hope this is useful for others.

The Remote System

For my network target, as it were, I am using a SPARC T5120 running Solaris 11.2. The actual release is not as important for this, as I am using only basic VLAN features.

Initial network configuration is as follows. It has some other things on it, that I am cutting from the output as it is not relevant to this topic.

root@remote# dladm show-phys
LINK              MEDIA                STATE      SPEED  DUPLEX    DEVICE
net1              Ethernet             up         1000   full      e1000g1
net2              Ethernet             up         1000   full      e1000g2
net0              Ethernet             up         1000   full      e1000g0
net3              Ethernet             up         1000   full      e1000g3

root@remote# dladm show-link
LINK                CLASS     MTU    STATE    OVER
net1                phys      1500   up       --
net2                phys      1500   up       --
net0                phys      1500   up       --
net3                phys      9000   up       --

root@remote# dladm show-vlan
root@remote# 
First I will create three VLANs that are configured on the switch, 111, 112, and 113.
 
root@remote# dladm create-vlan -l net3 -v 111 net3111
root@remote# dladm create-vlan -l net3 -v 112 net3111
root@remote# dladm create-vlan -l net3 -v 113 net3113
root@remote# 
root@remote# dladm show-link
LINK                CLASS     MTU    STATE    OVER
net1                phys      1500   up       --
net2                phys      1500   up       --
net0                phys      1500   up       --
net3                phys      9000   up       --
net3111             vlan      9000   up       net3
net3112             vlan      9000   up       net3
net3113             vlan      9000   up       net3
root@remote# 
root@remote# dladm show-vlan
LINK                VID  SVID PVLAN-TYPE  FLAGS  OVER
net3111             111  --   --          -----  net3
net3112             112  --   --          -----  net3
net3113             113  --   --          -----  net3
root@remote# 
If I had not set the data link name, in my case net3111 for the first one, Solaris would have used the old PPA (Physical Point of Attachment) standard that has been used in Solaris for a long time. They would have been net111003, net112003, and net113003. Those names require more typing. I do like naming where it is easy to recognize the data link the VLAN is on as well as the VLAN ID.

Next step is to put some IP address on those VLAN. I use 192.168.VLAN.x as my subnet, and I set "x" to the host part of the IP address of the base system, in this case "1".

root@remote# ipadm create-ip net3111
root@remote# ipadm create-ip net3112
root@remote# ipadm create-ip net3113
root@remote# 
root@remote# ipadm create-addr -a 192.168.111.1/24 net3111
net3111/v4
root@remote# ipadm create-addr -a 192.168.112.1/24 net3112
net3112/v4
root@remote# ipadm create-addr -a 192.168.113.1/24 net3113
net3113/v4
root@remote# 

root@remote# ipadm show-addr
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
net0/v4           static   ok           172.16.1.1/22
net3111/v4        static   ok           192.168.111.1/24
net3112/v4        static   ok           192.168.112.1/24
net3113/v4        static   ok           192.168.113.1/24
lo0/v6            static   ok           ::1/128
net0/v6           addrconf ok           fe80::214:4fff:feac:57c4/10
net0/v6           addrconf ok           2606:b400:602:c080:214:4fff:feac:57c4/64
root@remote# 
The remote system setup is complete.

Creating a Link Aggregation

On the SPARC T4-1 running Solaris 11.3, I will first create an aggregation and test is in the Control/Service Domain. I will use interfaces 1 and 3 on the system, since those are using two different physical chips on the system motherboard. In production, they likely are ports on two different NICs.
root@cdom# dladm create-aggr -l net1 -l net3 aggr1
root@cdom# 
root@cdom# dladm show-aggr
LINK              MODE  POLICY   ADDRPOLICY           LACPACTIVITY LACPTIMER
aggr1             trunk L4       auto                 off          short
root@cdom# 
root@cdom# dladm show-aggr -P
LINK              MODE  POLICY   ADDRPOLICY           LACPACTIVITY LACPTIMER
aggr1             trunk L4       auto                 off          short
root@cdom# dladm show-aggr -L
LINK                PORT         AGGREGATABLE SYNC COLL DIST DEFAULTED EXPIRED
aggr1               net1         no           no   no   no   no        no
--                  net3         no           no   no   no   no        no
root@cdom# dladm show-aggr -x
LINK       PORT           SPEED DUPLEX   STATE     ADDRESS            PORTSTATE
aggr1      --             1000Mb full    up        0:21:28:d2:17:f9   --
           net1           1000Mb full    up        0:21:28:d2:17:f9   attached
           net3           1000Mb full    up        0:21:28:d2:17:fb   attached
root@cdom# 
I show different outputs of the dladm(1M) command here, and we'll see some differences later.

The aggregation is on a private network on the Netgear switch, so a snoop will not show a lot of traffic. I will generate some traffic using ping, and I will be switching between the two systems for that.

root@remote# ping 192.168.111.101 2
no answer from 192.168.111.101
root@remote# ping 192.168.112.101 2
no answer from 192.168.112.101
root@remote# ping 192.168.113.101 2
no answer from 192.168.113.101
root@remote# 
To keep output short, and make testing faster, I only sent two packets per IP address, since I know there is not going to be an answer. So what does this look like on the system with the aggregation?
root@cdom# snoop -d aggr1
Using device aggr1 (promiscuous mode)
VLAN#111: 192.168.111.1 -> (broadcast)  ARP C Who is 192.168.111.101, 192.168.111.101 ?
VLAN#111: 192.168.111.1 -> (broadcast)  ARP C Who is 192.168.111.101, 192.168.111.101 ?
VLAN#111: 192.168.111.1 -> (broadcast)  ARP C Who is 192.168.111.1, 192.168.111.1 ?
VLAN#111: 192.168.111.1 -> (broadcast)  ARP C Who is 192.168.111.101, 192.168.111.101 ?
VLAN#112: 192.168.112.1 -> (broadcast)  ARP C Who is 192.168.112.1, 192.168.112.1 ?
VLAN#112: 192.168.112.1 -> (broadcast)  ARP C Who is 192.168.112.101, 192.168.112.101 ?
VLAN#112: 192.168.112.1 -> (broadcast)  ARP C Who is 192.168.112.101, 192.168.112.101 ?
VLAN#112: 192.168.112.1 -> (broadcast)  ARP C Who is 192.168.112.101, 192.168.112.101 ?
VLAN#113: 192.168.113.1 -> (broadcast)  ARP C Who is 192.168.113.1, 192.168.113.1 ?
VLAN#113: 192.168.113.1 -> (broadcast)  ARP C Who is 192.168.113.101, 192.168.113.101 ?
VLAN#113: 192.168.113.1 -> (broadcast)  ARP C Who is 192.168.113.101, 192.168.113.101 ?
VLAN#113: 192.168.113.1 -> (broadcast)  ARP C Who is 192.168.113.101, 192.168.113.101 ?
First notice that each line includes the VLAN ID. This is a new feature in Solaris 11, and may have been back ported to a late update of Solaris 10 (I will have to check and come back to that.)

You can see the ARP requests for all three VLANs with the target address on each. This is why I like to have the VLAN ID and the subnet the same. I am beginning to notice this with some customers as well.

My first test is to bring down one or both ports and see the changes in the aggregation and the network. After turning off the port of net1, this is how things look.

root@cdom# dladm show-aggr -x
LINK       PORT           SPEED DUPLEX   STATE     ADDRESS            PORTSTATE
aggr1      --             1000Mb full    up        0:21:28:d2:17:f9   --
           net1           0Mb  unknown   down      0:21:28:d2:17:f9   standby
           net3           1000Mb full    up        0:21:28:d2:17:fb   attached
root@cdom# 
root@cdom# snoop -d aggr1
Using device aggr1 (promiscuous mode)
VLAN#111: 192.168.111.1 -> (broadcast)  ARP C Who is 192.168.111.101, 192.168.111.101 ?
VLAN#111: 192.168.111.1 -> (broadcast)  ARP C Who is 192.168.111.101, 192.168.111.101 ?
VLAN#111: 192.168.111.1 -> (broadcast)  ARP C Who is 192.168.111.101, 192.168.111.101 ?
^Croot@cdom# 
The aggregation stays up, and traffic continues to come into the system. Port status changes are also in /var/adm/messages. I don't see anything going to the console, however. Messages are limited as the aggregation is not plumbed nor IP is using it, even when both ports are down.
root@cdom# dladm show-aggr -x
LINK       PORT           SPEED DUPLEX   STATE     ADDRESS            PORTSTATE
aggr1      --             0Mb  unknown   down      0:21:28:d2:17:f9   --
           net1           0Mb  unknown   down      0:21:28:d2:17:f9   standby
           net3           0Mb  unknown   down      0:21:28:d2:17:fb   standby
root@cdom# 

root@cdom# tail /var/adm/messages
...
Apr 21 11:00:07 gravity mac: [ID 486395 kern.info] NOTICE: igb3 link down
Apr 21 11:05:40 gravity mac: [ID 486395 kern.info] NOTICE: igb1 link down
Apr 21 11:05:40 gravity mac: [ID 486395 kern.info] NOTICE: aggr1 link down
root@cdom# 
To better see network connectivity, I will create a VLAN and configure an IP address.
root@cdom# dladm create-vlan -l aggr1 -v 111 aggr1111
root@cdom# ipadm create-ip aggr1111
root@cdom# ipadm create-addr -a 192.168.111.5/24 aggr1111
aggr1111/v4
root@cdom# 
root@cdom# ipadm show-addr
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
net0/v4           static   ok           192.168.1.5/22
net4/v4           static   ok           169.254.182.77/24
aggr1111/v4       static   ok           192.168.111.5/24
lo0/v6            static   ok           ::1/128
net0/v6           addrconf ok           fe80::221:28ff:fed2:17f8/10
net0/v6           addrconf ok           2606:b400:602:c080:221:28ff:fed2:17f8/64
root@cdom# 
The networks works as seen from the remote system.
root@remote# ping 192.168.111.5 2
192.168.111.5 is alive
root@remote# 
I will first bring one port down, then both.
root@cdom# dladm show-aggr -x
LINK       PORT           SPEED DUPLEX   STATE     ADDRESS            PORTSTATE
aggr1      --             1000Mb full    up        0:21:28:d2:17:f9   --
           net1           1000Mb full    up        0:21:28:d2:17:f9   attached
           net3           1000Mb full    up        0:21:28:d2:17:fb   attached
root@cdom# 

root@cdom# dladm show-aggr -x
LINK       PORT           SPEED DUPLEX   STATE     ADDRESS            PORTSTATE
aggr1      --             1000Mb full    up        0:21:28:d2:17:f9   --
           net1           0Mb  unknown   down      0:21:28:d2:17:f9   standby
           net3           1000Mb full    up        0:21:28:d2:17:fb   attached
root@cdom# 
root@remote# ping 192.168.111.5 2
192.168.111.5 is alive
root@remote# 
root@cdom# dladm show-aggr -x
LINK       PORT           SPEED DUPLEX   STATE     ADDRESS            PORTSTATE
aggr1      --             0Mb  unknown   down      0:21:28:d2:17:f9   --
           net1           0Mb  unknown   down      0:21:28:d2:17:f9   standby
           net3           0Mb  unknown   down      0:21:28:d2:17:fb   standby
root@cdom# 
root@cdom# ipadm show-addr
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
net0/v4           static   ok           192.168.1.5/22
net4/v4           static   ok           169.254.182.77/24
aggr1111/v4       static   inaccessible 192.168.111.5/24
lo0/v6            static   ok           ::1/128
net0/v6           addrconf ok           fe80::221:28ff:fed2:17f8/10
net0/v6           addrconf ok           2606:b400:602:c080:221:28ff:fed2:17f8/64
root@cdom# 
root@remote# ping 192.168.111.5 2
no answer from 192.168.111.5
root@remote# 
With this I have created an aggregation, shown VLAN, and demonstrated what happens when one or both ports in the aggregation fail. The aggregation remains functional with one port failure and networking continues. The aggregation fails with both ports down, and the IP address shows to be inaccessible.

Setting up the Virtual Switch in a Service Domain

I use the terms Service Domain and Control Domain to refer to the specific function I am doing or working on. On this system, there is only one Service Domain, and it is also the Control Domain. The concepts and steps I am outlining here apply also to second or redundant Service Domains when a system is configured with more than one.

This is an area where there are differences between Solaris 11 and Solaris 10 Service Domains. Oracle highly recommends that all Service Domains are running Solaris 11.

I will be creating a virtual switch on top of the aggr1 data link while the existing VLANs are already there. If this were Solaris 10, I'd likely remove them and if I need Service Domain access to the aggregation, I would use the virtual switch.

In Solaris 11, there is no need to set or modify the pvid and vid parameters on the virtual switch. If this were Solaris 10 and I wanted to get access to VLANs on the data link (in this case aggr1) I would need to set those.

Let us get started on the virtual switch.

root@cdom# ldm add-vsw net-dev=aggr1 primary-vsw1 primary
root@cdom# 
root@cdom# ldm list-services
...
VSW
    NAME             LDOM             MAC               NET-DEV   ID   DEVICE     LINKPROP   DEFAULT-VLAN-ID PVID VID                  MTU   MODE   INTER-VNET-LINK
    primary-vsw0     primary          00:14:4f:f9:b4:9f net0      0    switch@0              1               1                         1500         on         
    primary-vsw1     primary          00:14:4f:f8:44:87 aggr1     1    switch@1              1               1                         1500         on         

...
root@cdom# 

root@cdom# ldm list
NAME             STATE      FLAGS   CONS    VCPU  MEMORY   UTIL  NORM  UPTIME
primary          active     -n-cv-  UART    16    7680M    0.2%  0.2%  69d 17h 38m
host1            active     -n----  5000    8     4G       0.2%  0.2%  5d 21h 23m
root@cdom# 

There is an existing Guest Domain on the system, and I will use that one to demonstrate the networking. I will do this in steps, to cover a range of LDom networking items. First item is to create a new virtual network device (vnet) for the Guest Domain. I show before and after.
guest@host1:~$ pfbash [1]
guest@host1:~$ 
guest@host1:~$ PS1="guest-pf@host1$ "
guest-pf@host1$

guest-pf@host1$ dladm show-phys        
LINK              MEDIA                STATE      SPEED  DUPLEX    DEVICE
net0              Ethernet             up         0      unknown   vnet0
guest-pf@host1$ 
guest-pf@host1$ dladm show-link
LINK                CLASS     MTU    STATE    OVER
net0                phys      1500   up       --
guest-pf@host1$ 
root@cdom# ldm add-vnet linkprop=phys-state vnet1 primary-vsw1 host1
root@cdom# 
guest-pf@host1$ dladm show-phys
LINK              MEDIA                STATE      SPEED  DUPLEX    DEVICE
net0              Ethernet             up         0      unknown   vnet0
net1              Ethernet             unknown    0      unknown   vnet1
guest-pf@host1$ 
In the background, on the remote system, I have steady ping(1) running on all three subnets. A snoop shows no traffic (I won't bother to show "nothing" here (how do I add a smiley?) So we have successfully added a virtualized network interface where the underlying data link is an aggregation.

[1] You may wonder what I did here. If you don't, skip to the next section.

I am using Solaris' Role Based Access Control feature, where this user has been given a lot of privileges. I could just do an su(1) to root. Instead, I am running as the user in the profile shell version of bash. Every command is then checked for authorization. This is easier than running "pfexec command", for those familiar with sudo, "sudo command". The pfexec(1) command does not prompt for a password.

Starting to Work with VLANs

I keep the snoop running while I add a VLAN ID to the vnet, as a "vid", which means it will also show up in the Guest Domain with the VLAN tag.
root@cdom# ldm set-vnet vid=111 vnet1 host1
root@cdom# 
guest-pf@host1$ snoop -d net1
Using device net1 (promiscuous mode)
VLAN#111: 192.168.111.1 -> (broadcast)  ARP C Who is 192.168.111.6, 192.168.111.6 ?
VLAN#111: 192.168.111.1 -> (broadcast)  ARP C Who is 192.168.111.6, 192.168.111.6 ?
^Cguest-pf@host1$ 
Not easy to show here, but almost immediately after running the set-vnet command snoop sees traffic on VLAN 111. Just as expected. The pings/ARPs on the other two VLANs are still not coming through, as they are not assigned to the vnet. Now I will add VLAN 112 also as a vid, and I will add VLAN 113 as a pvid. That means 113 traffic will come in with the VLAN tag removed.
root@cdom# ldm set-vnet vid=111,112 pvid=113 vnet1 host1
root@cdom# 
guest-pf@host1$ snoop -d net1
Using device net1 (promiscuous mode)
VLAN#111: 192.168.111.1 -> (broadcast)  ARP C Who is 192.168.111.6, 192.168.111.6 ?
VLAN#111: 192.168.111.1 -> (broadcast)  ARP C Who is 192.168.111.6, 192.168.111.6 ?
VLAN#111: 192.168.111.1 -> (broadcast)  ARP C Who is 192.168.111.6, 192.168.111.6 ?
VLAN#111: 192.168.111.1 -> (broadcast)  ARP C Who is 192.168.111.6, 192.168.111.6 ?
VLAN#112: 192.168.112.1 -> (broadcast)  ARP C Who is 192.168.112.6, 192.168.112.6 ?
192.168.113.1 -> (broadcast)  ARP C Who is 192.168.113.6, 192.168.113.6 ?
VLAN#112: 192.168.112.1 -> (broadcast)  ARP C Who is 192.168.112.6, 192.168.112.6 ?
VLAN#111: 192.168.111.1 -> (broadcast)  ARP C Who is 192.168.111.6, 192.168.111.6 ?
192.168.113.1 -> (broadcast)  ARP C Who is 192.168.113.6, 192.168.113.6 ?
VLAN#111: 192.168.111.1 -> (broadcast)  ARP C Who is 192.168.111.6, 192.168.111.6 ?
VLAN#112: 192.168.112.1 -> (broadcast)  ARP C Who is 192.168.112.6, 192.168.112.6 ?
VLAN#111: 192.168.111.1 -> (broadcast)  ARP C Who is 192.168.111.6, 192.168.111.6 ?
192.168.113.1 -> (broadcast)  ARP C Who is 192.168.113.6, 192.168.113.6 ?
VLAN#112: 192.168.112.1 -> (broadcast)  ARP C Who is 192.168.112.6, 192.168.112.6 ?
^Cguest-pf@host1$ 
As I expected, the VLAN 111 traffic continues, and then I see VLAN 112 also tagged (the VLAN#111: prefix) and the VLAN 113 traffic untagged. Easy to see because of the third octet in each IP address.

Note that I can add or remove VLANs while the interface is running in the Guest Domain.

Testing Link Failure with LDoms

The next step it to show how a failure of one or both ports in the aggregation affects the Guest Domain. I add an IP address in the Guest, and then turn off one, then both ports.
guest-pf@host1$ dladm show-phys
LINK              MEDIA                STATE      SPEED  DUPLEX    DEVICE
net0              Ethernet             up         0      unknown   vnet0
net1              Ethernet             down       0      unknown   vnet1
guest-pf@host1$ 

guest-pf@host1$ ipadm create-ip net1
guest-pf@host1$ ipadm create-addr -a 192.168.113.6/24 net1
net1/v4
guest-pf@host1$ 

guest-pf@host1$ ipadm show-addr
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
net0/v4           static   ok           192.168.1.6/22
net1/v4           static   ok           192.168.113.6/24
lo0/v6            static   ok           ::1/128
net0/v6           addrconf ok           fe80::214:4fff:fef9:fc75/10
net0/v6           addrconf ok           2606:b400:602:c080:214:4fff:fef9:fc75/64
guest-pf@host1$ 

guest-pf@host1$ dladm show-phys
LINK              MEDIA                STATE      SPEED  DUPLEX    DEVICE
net0              Ethernet             up         0      unknown   vnet0
net1              Ethernet             up         0      unknown   vnet1
guest-pf@host1$ 

guest-pf@host1$ snoop -d net1
Using device net1 (promiscuous mode)
VLAN#112: 192.168.112.1 -> (broadcast)  ARP C Who is 192.168.112.6, 192.168.112.6 ?
VLAN#111: 192.168.111.1 -> (broadcast)  ARP C Who is 192.168.111.6, 192.168.111.6 ?
192.168.113.1 -> 192.168.113.6 ICMP Echo request (ID: 26169 Sequence number: 8405)
192.168.113.6 -> 192.168.113.1 ICMP Echo reply (ID: 26169 Sequence number: 8405)
VLAN#111: 192.168.111.1 -> (broadcast)  ARP C Who is 192.168.111.6, 192.168.111.6 ?
VLAN#112: 192.168.112.1 -> (broadcast)  ARP C Who is 192.168.112.6, 192.168.112.6 ?
VLAN#111: 192.168.111.1 -> (broadcast)  ARP C Who is 192.168.111.6, 192.168.111.6 ?
192.168.113.1 -> 192.168.113.6 ICMP Echo request (ID: 26169 Sequence number: 8406)
192.168.113.6 -> 192.168.113.1 ICMP Echo reply (ID: 26169 Sequence number: 8406)
VLAN#112: 192.168.112.1 -> (broadcast)  ARP C Who is 192.168.112.6, 192.168.112.6 ?
^Cguest-pf@host1$ 
guest-pf@host1$ 
After configuring the address, I can see ICMP echo and reply messages to that address. I chose to do this on VLAN 113, which is untagged, however, the same would work if I create a VLAN data link. I will show that on a different step.

I quickly tested the configuration with one port on the switch down, and in the guest it looks the same. Then I marked the second port down. This makes the aggregation down.

root@cdom# dladm show-aggr -x
LINK       PORT           SPEED DUPLEX   STATE     ADDRESS            PORTSTATE
aggr1      --             0Mb  unknown   down      0:21:28:d2:17:f9   --
           net1           0Mb  unknown   down      0:21:28:d2:17:f9   standby
           net3           0Mb  unknown   down      0:21:28:d2:17:fb   standby
root@cdom# 
So how does this look in the Guest Domain? You can see that it also sees the virtual network interface down. How did that happen? I'll explain shortly.
guest-pf@host1$ dladm show-phys
LINK              MEDIA                STATE      SPEED  DUPLEX    DEVICE
net0              Ethernet             up         0      unknown   vnet0
net1              Ethernet             down       0      unknown   vnet1
guest-pf@host1$ 
guest-pf@host1$ ipadm show-addr
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
net0/v4           static   ok           192.168.1.6/22
net1/v4           static   inaccessible 192.168.113.6/24
lo0/v6            static   ok           ::1/128
net0/v6           addrconf ok           fe80::214:4fff:fef9:fc75/10
net0/v6           addrconf ok           2606:b400:602:c080:214:4fff:fef9:fc75/64
guest-pf@host1$ 
guest-pf@host1$ snoop -d net1
Using device net1 (promiscuous mode)
^Cguest-pf@host1$ 
guest-pf@host1$ 
You may have noticed that when I set up the vnet in the Service Domain, I used an option "linkprop=phys-state". This LDom option uses an out of band protocol to pass the link state of the underlying data link to the guest. Without this, because there is a virtual switch between the physical data link or aggregation and the virtual network interface (vnet), the latter would not see a hardware failure. It can still communicate with other vnets on the same virtual switch. This link state propagation was added to LDoms a number of years ago.

To demonstrate, I will turn linkprop off, and then look at the interface in the Guest Domain.

root@cdom# ldm set-vnet linkprop="" vnet1 host1
root@cdom# 
guest-pf@host1$ dladm show-phys
LINK              MEDIA                STATE      SPEED  DUPLEX    DEVICE
net0              Ethernet             up         0      unknown   vnet0
net1              Ethernet             up         0      unknown   vnet1
guest-pf@host1$ 
guest-pf@host1$ ipadm show-addr
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
net0/v4           static   ok           192.168.1.6/22
net1/v4           static   ok           192.168.113.6/24
lo0/v6            static   ok           ::1/128
net0/v6           addrconf ok           fe80::214:4fff:fef9:fc75/10
net0/v6           addrconf ok           2606:b400:602:c080:214:4fff:fef9:fc75/64
guest-pf@host1$ 
The Guest thinks the link is working. A snoop was completely quiet. I'll turn linkprop back on, and then enable the ports again to put everything into a working state. Behind the scenes I see my ping showing success on the remote system. Another validation that the network is working again.

Using Solaris Virtual Network Interfaces (VNICs) in LDoms

Must customers using LDoms are also using Solaris Zones. In Solaris 11 is a key feature, network virtualization. This allows a user, or the Solaris Zones framework, to create individual virtual NICs (VNICs) for Zones, making consolidation much easier and the Zones behave more as if they are different systems with their own networking hardware. Before moving on to Zones, I'd like to test this with a VNIC manually.

Let's give it a try.

guest-pf@host1$ dladm show-phys
LINK              MEDIA                STATE      SPEED  DUPLEX    DEVICE
net0              Ethernet             up         0      unknown   vnet0
net1              Ethernet             up         0      unknown   vnet1
guest-pf@host1$ 
guest-pf@host1$ dladm show-phys -m
LINK                SLOT     ADDRESS            INUSE CLIENT
net0                primary  0:14:4f:f9:fc:75   yes   net0
                    1        0:14:4f:fb:a1:78   no    --
                    2        0:14:4f:f8:f9:32   no    --
                    3        0:14:4f:f9:ab:37   no    --
                    4        0:14:4f:f8:1:93    no    --
net1                primary  0:14:4f:f8:3e:e5   yes   net1
guest-pf@host1$ 
guest-pf@host1$ dladm create-vnic -l net1 vnic11
dladm: vnic creation failed: operation not supported
guest-pf@host1$ 
guest-pf@host1$ dladm create-vnic -l net0 vnic1
guest-pf@host1$ 
guest-pf@host1$ dladm show-phys -m
LINK                SLOT     ADDRESS            INUSE CLIENT
net0                primary  0:14:4f:f9:fc:75   yes   net0
                    1        0:14:4f:fb:a1:78   yes   vnic1
                    2        0:14:4f:f8:f9:32   no    --
                    3        0:14:4f:f9:ab:37   no    --
                    4        0:14:4f:f8:1:93    no    --
net1                primary  0:14:4f:f8:3e:e5   yes   net1
guest-pf@host1$ 
Oops. Creating a VNIC on net1 failed. Why is that? Turns out each vnic needs its own MAC, since it will have its own IP address on it--this is definitely the case in a Zone. However, the underlying "physical" interface, in this case a vnet only has one MAC address. And while on an actual physical interface it is possible to add more MAC addresses, through some device driver mechanics, this is not possible on a vnet.

This is also why I chose to show VNICs outside of Zones. If we had gone straight to Zone creating and start-up, this failure might be harder to track down.

Several years ago LDoms added a new feature to assign additional MAC addresses to a vnet. The property is called "alt-mac-addrs". It allows a fixed number of MAC addresses to be assigned to the vnet. Unfortunately, this vnet property can not be set or changed when a Guest Domain is running. So I will shut the Guest down.

guest-pf@host1$ init 5
updating /platform/sun4v/boot_archive
guest-pf@host1$ 
root@cdom# ldm set-vnet alt-mac-addrs=auto,auto,auto,auto,auto,auto vnet1 host1
Please perform the operation while the LDom is bound or inactive
root@cdom# 
root@cdom# ldm list
NAME             STATE      FLAGS   CONS    VCPU  MEMORY   UTIL  NORM  UPTIME
primary          active     -n-cv-  UART    16    7680M    1.3%  1.3%  69d 22h 15m
host1            active     -n----  5000    8     4G       0.1%  0.1%  3h 54m
root@cdom# 

root@cdom# ldm list
NAME             STATE      FLAGS   CONS    VCPU  MEMORY   UTIL  NORM  UPTIME
primary          active     -n-cv-  UART    16    7680M    0.5%  0.5%  69d 22h 16m
host1            bound      ------  5000    8     4G                   
root@cdom# 

root@cdom# ldm set-vnet alt-mac-addrs=auto,auto,auto,auto,auto,auto vnet1 host1
root@cdom# 
root@cdom# ldm start host1
LDom host1 started
root@cdom# 
I show the error message when I tried to change the vnet while the Guest Domain is running. Once it was stopped, the operation was successful. You may notice that I list six time the work "auto". I am adding six MAC addresses to the vnet. And I am allowing each MAC address to be automatically generated. If I need to keep MAC addresses across configurations, I can set the explicitly.

Once the Guest Domain is back up, and can see what things look like now.

guest-pf@host1$ dladm show-phys -m
LINK                SLOT     ADDRESS            INUSE CLIENT
net0                primary  0:14:4f:f9:fc:75   yes   net0
                    1        0:14:4f:fb:a1:78   yes   vnic1
                    2        0:14:4f:f8:f9:32   no    --
                    3        0:14:4f:f9:ab:37   no    --
                    4        0:14:4f:f8:1:93    no    --
net1                primary  0:14:4f:f8:3e:e5   yes   net1
                    1        0:14:4f:fa:a6:5e   no    --
                    2        0:14:4f:f8:92:c0   no    --
                    3        0:14:4f:f9:77:8c   no    --
                    4        0:14:4f:fb:d8:33   no    --
                    5        0:14:4f:f8:50:1    no    --
                    6        0:14:4f:fa:bc:2d   no    --
guest-pf@host1$ 
Here you see the six MAC addresses on the second interface. That is one reason I chose a number other than my typical four MACs. This time the operation to create a VNIC on net1 should succeed.
guest-pf@host1$ dladm create-vnic -l net1 vnic11
guest-pf@host1$ 
guest-pf@host1$ dladm show-phys -m
LINK                SLOT     ADDRESS            INUSE CLIENT
net0                primary  0:14:4f:f9:fc:75   yes   net0
                    1        0:14:4f:fb:a1:78   yes   vnic1
                    2        0:14:4f:f8:f9:32   no    --
                    3        0:14:4f:f9:ab:37   no    --
                    4        0:14:4f:f8:1:93    no    --
net1                primary  0:14:4f:f8:3e:e5   yes   net1
                    1        0:14:4f:fa:a6:5e   yes   vnic11
                    2        0:14:4f:f8:92:c0   no    --
                    3        0:14:4f:f9:77:8c   no    --
                    4        0:14:4f:fb:d8:33   no    --
                    5        0:14:4f:f8:50:1    no    --
                    6        0:14:4f:fa:bc:2d   no    --
guest-pf@host1$ 
guest-pf@host1$ dladm show-vnic
LINK            OVER           SPEED  MACADDRESS        MACADDRTYPE IDS
vnic1           net0           0      0:14:4f:fb:a1:78  factory, slot 1 VID:0
vnic11          net1           0      0:14:4f:fa:a6:5e  factory, slot 1 VID:0
guest-pf@host1$ 
Success indeed. I will get rid of the VNIC on net0 to simplify output.
guest-pf@host1$ dladm delete-vnic vnic1
guest-pf@host1$ 
guest-pf@host1$ dladm show-vnic
LINK            OVER           SPEED  MACADDRESS        MACADDRTYPE IDS
vnic11          net1           0      0:14:4f:fa:a6:5e  factory, slot 1 VID:0
guest-pf@host1$ 
Before moving on to Zones, I want to show two things. Creating a interface on a VLAN, and showing that full aggregation failure also propagates to the VNIC.There are two types of operations: one VLANs; and on VNICs. When creating a VNIC I can specify a VLAN ID, so I can show both in a single operation.
guest-pf@host1$ dladm create-vnic -l net1 -v 111 vnic1111
guest-pf@host1$ 
guest-pf@host1$ dladm show-vnic
LINK            OVER           SPEED  MACADDRESS        MACADDRTYPE IDS
vnic11          net1           0      0:14:4f:fa:a6:5e  factory, slot 1 VID:0
vnic1111        net1           0      0:14:4f:f8:92:c0  factory, slot 2 VID:111
guest-pf@host1$ dladm show-vlan
guest-pf@host1$ 
guest-pf@host1$ ipadm create-ip vnic1111
guest-pf@host1$ ipadm create-addr -a 192.168.111.6/24 vnic1111
vnic1111/v4
guest-pf@host1$ 

guest-pf@host1$ snoop -d net1
Using device net1 (promiscuous mode)
VLAN#112: 192.168.112.1 -> (broadcast)  ARP C Who is 192.168.112.6, 192.168.112.6 ?
VLAN#111: 192.168.111.1 -> 192.168.111.6 ICMP Echo request (ID: 26167 Sequence number: 13612)
VLAN#111: 192.168.111.6 -> 192.168.111.1 ICMP Echo reply (ID: 26167 Sequence number: 13612)
192.168.113.1 -> 192.168.113.6 ICMP Echo request (ID: 26169 Sequence number: 13601)
192.168.113.6 -> 192.168.113.1 ICMP Echo reply (ID: 26169 Sequence number: 13601)
VLAN#111: 192.168.111.6 -> (broadcast)  ARP C Who is 192.168.111.6, 192.168.111.6 ?
VLAN#112: 192.168.112.1 -> (broadcast)  ARP C Who is 192.168.112.6, 192.168.112.6 ?
VLAN#111: 192.168.111.1 -> 192.168.111.6 ICMP Echo request (ID: 26167 Sequence number: 13613)
VLAN#111: 192.168.111.6 -> 192.168.111.1 ICMP Echo reply (ID: 26167 Sequence number: 13613)
192.168.113.1 -> 192.168.113.6 ICMP Echo request (ID: 26169 Sequence number: 13602)
192.168.113.6 -> 192.168.113.1 ICMP Echo reply (ID: 26169 Sequence number: 13602)
VLAN#112: 192.168.112.1 -> (broadcast)  ARP C Who is 192.168.112.6, 192.168.112.6 ?
^Cguest-pf@host1$ 
Here I created a VNIC on top of net1 with VLAN ID 111. I can see those details with dladm(1M).

And snoop now shows that pings are working on 192.168.113.6 and now 192.168.111.6. Now I will disable both interfaces on the switch.

guest-pf@host1$ dladm show-phys
LINK              MEDIA                STATE      SPEED  DUPLEX    DEVICE
net0              Ethernet             up         0      unknown   vnet0
net1              Ethernet             down       0      unknown   vnet1
guest-pf@host1$ 
guest-pf@host1$ dladm show-link
LINK                CLASS     MTU    STATE    OVER
net0                phys      1500   up       --
net1                phys      1500   up       --
vnic11              vnic      1500   up       net1
vnic1111            vnic      1500   down     net1
guest-pf@host1$ 
guest-pf@host1$ ipadm show-addr
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
net0/v4           static   ok           192.168.1.6/22
net1/v4           static   ok           192.168.113.6/24
vnic1111/v4       static   inaccessible 192.168.111.6/24
lo0/v6            static   ok           ::1/128
net0/v6           addrconf ok           fe80::214:4fff:fef9:fc75/10
net0/v6           addrconf ok           2606:b400:602:c080:214:4fff:fef9:fc75/64
guest-pf@host1$ 
guest-pf@host1$ snoop -d net1
Using device net1 (promiscuous mode)
^Cguest-pf@host1$ 
I am a bit stumped. The VNIC on net1 is showing the interfaces are down, however, the base interface is not. I see that both at the data link layer with dladm and the IP layer with ipadm. I thought this may be a bug, however, Solaris network engineering is saying this is expected behavior when only one VNIC is up. The VNICs can still be used to communicate with each other, even though the underlying data link is down. As would be the case with any switch were the uplink is down. Hosts can still communicate.

Note: I may come back to this later and update details.

Let us move on to Zones.

Using the LDoms and Solaris Zones Network Virtualization Features Together

Now I would like to combine all the features into creating a Zone. The Link Aggregation is being handled by the Service Domain. This is really convenient, as all LDoms and Zones will benefit from he increased availability of the aggregation. And since each VNIC has its own MAC address, inbound traffic that is hashed at Layer 2 may still have its load spread across the member links in the aggregation. Solaris' load spreading is at L4, using TCP or UDP headers, so it is already likely to spread.

I will not focus on the mechanics of creating a Solaris Zone here. Others and I have done that elsewhere. However, the network details of the Zone configuration are important to highlight.

guest-pf@host1$ zonecfg -z myzone info anet
anet:
	linkname: net0
	lower-link: net1
	allowed-address not specified
	configure-allowed-address: true
	defrouter not specified
	allowed-dhcp-cids not specified
	link-protection: mac-nospoof
	mac-address: auto
	mac-prefix not specified
	mac-slot not specified
	vlan-id not specified
	priority not specified
	rxrings not specified
	txrings not specified
	mtu not specified
	maxbw not specified
	bwshare not specified
	rxfanout not specified
	vsi-typeid not specified
	vsi-vers not specified
	vsi-mgrid not specified
	etsbw-lcl not specified
	cos not specified
	pkey not specified
	linkmode not specified
	evs not specified
	vport not specified
anet:
	linkname: net1
	lower-link: net1
...
	vlan-id: 111
...
anet:
	linkname: net2
	lower-link: net1
...
	vlan-id: 112
...
guest-pf@host1$ 
Each network section is started with "anet" for Automated Network. This feature in Solaris 11 will create a VNIC for each entry when the Zone boots, and will remove it when the Zone halts. This simplifies Zone networks and limits the privileges an administrator needs to those for Zone Configuration. The user "guest" has those privileges.

The link "net0" had the defaults, and is using the net1 interface. Since "vlan-id" is not specified, it will use the untagged inteface, or VLAN 113.

The other two interfaces, net1 and net2 will use VLAN IDs 111 and 112, respectively.

Because I did not give guest all Zone privileges, I perform a few operations here as root. User guest can start and stop Zones, and also log into the Zone.

guest-pf@host1$ su
Password: 
root@host1:~# 
root@host1:~# zonecfg -z myzone -f myzone.cfg 
UX: /usr/sbin/usermod: guest is currently logged in, some changes may not take effect until next login.
root@host1:~# 
root@host1:~# zoneadm -z myzone install -c myzone.xml 
The following ZFS file system(s) have been created:
    rpool/zones
    rpool/zones/myzone
Progress being logged to /var/log/zones/zoneadm.20160421T225323Z.myzone.install
       Image: Preparing at /zones/myzone/root.

 Install Log: /system/volatile/install.1585/install_log
 AI Manifest: /tmp/manifest.xml.P6aOed
  SC Profile: /export/home/guest/myzone.xml
    Zonename: myzone
Installation: Starting ...

        Creating IPS image
Startup linked: 1/1 done
        Installing packages from:
            solaris
                origin:  http://172.16.1.1/
DOWNLOAD                                PKGS         FILES    XFER (MB)   SPEED
Completed                            279/279   48306/48306  354.4/354.4  1.6M/s

PHASE                                          ITEMS
Installing new actions                   66017/66017
Updating package state database                 Done 
Updating package cache                           0/0 
Updating image state                            Done 
Creating fast lookup database                   Done 
Updating package cache                           1/1 
Installation: Succeeded

        Note: Man pages can be obtained by installing pkg:/system/manual

 done.

        Done: Installation completed in 431.151 seconds.


  Next Steps: Boot the zone, then log into the zone console (zlogin -C)

              to complete the configuration process.

Log saved in non-global zone as /zones/myzone/root/var/log/zones/zoneadm.20160421T225323Z.myzone.install
root@host1:~#
root@host1:~# exit
exit
guest-pf@host1$ 

guest-pf@host1$ zoneadm -z myzone boot
guest-pf@host1$ 
I save myself a few steps with a System Configuration File that sets the hostname, IP addresses, and the like, so I am not prompted for that information the first time it boots.
guest-pf@host1$ zlogin myzone
[Connected to zone 'myzone' pts/2]
Last login: Thu Apr 21 19:15:02 2016 on pts/2
Oracle Corporation	SunOS 5.11	11.3	February 2016
root@myzone:~# 
root@myzone:~# dladm show-phys
root@myzone:~# dladm show-link
LINK                CLASS     MTU    STATE    OVER
net2                vnic      1500   up       ?
net1                vnic      1500   up       ?
net0                vnic      1500   up       ?
root@myzone:~# 
root@myzone:~# ipadm show-addr
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
net0/v4           static   ok           192.168.113.16/24
lo0/v6            static   ok           ::1/128
net0/v6           addrconf ok           fe80::214:4fff:fef8:5001/10
root@myzone:~# 
root@myzone:~# ping 192.168.113.1
192.168.113.1 is alive
root@myzone:~# 
root@myzone:~# snoop -d net0
Using device net0 (promiscuous mode)
^Croot@myzone:~# 
root@myzone:~# 

root@myzone:~# snoop -d net1
Using device net1 (promiscuous mode)
192.168.111.1 -> (broadcast)  ARP C Who is 192.168.111.2, 192.168.111.2 ?
192.168.111.1 -> (broadcast)  ARP C Who is 192.168.111.2, 192.168.111.2 ?
^Croot@myzone:~# 
root@myzone:~# snoop -d net2
Using device net2 (promiscuous mode)
192.168.112.1 -> (broadcast)  ARP C Who is 192.168.112.6, 192.168.112.6 ?
192.168.112.1 -> (broadcast)  ARP C Who is 192.168.112.6, 192.168.112.6 ?
^Croot@myzone:~# 
While the 113 VLAN on net0 is relatively quiet (all ping attempts are being met and thus no broadcasting is going on), there is traffic visible on VLANs 111 and 112. What you may note here is that the VNICs are bringing data into the Zone without the VLAN tags. At this time only one VLAN ID can be set for a VNIC, and thus there is no need bring in the tag, and it actually hides some network details and complexity from the Zone.

I think the final item I want to show is the link failure as seen in the Zone.

root@myzone:~# dladm show-link
LINK                CLASS     MTU    STATE    OVER
net2                vnic      1500   down     ?
net1                vnic      1500   up       ?
net0                vnic      1500   up       ?
Again, not all VNICs are showing they are down. What does it look like in the Global Zone?
guest-pf@host1$ dladm show-link
LINK                CLASS     MTU    STATE    OVER
net0                phys      1500   up       --
net1                phys      1500   up       --
vnic11              vnic      1500   up       net1
vnic1111            vnic      1500   up       net1
myzone/net2         vnic      1500   down     net1
myzone/net1         vnic      1500   up       net1
myzone/net0         vnic      1500   up       net1
guest-pf@host1$ 
Also here, only one VNIC is showing the link is down. Here you can see another benefit of using the anet feature. Each VNIC of the zone is identefied with the Zone name's prefix.

Wrapping Things Up

So we have gone over the following items
  • Creating an aggregation in Solaris 11
  • Creating a VLAN on an aggregation
  • Showing what happens when link(s) fail
  • Creating an LDom virtual switch in a Solaris 11 Service Domain
  • Adding a virtual network (vnet) interface to an LDom Guest Domain
  • Configuring and testing VLANs on the vnet
  • Demonstrating link failure propagation with an LDom vnet
  • Creating a Solaris 11 VNIC in a Guest Domain
  • Show how Zones use VNICs and VLANs
Wow, that was a lot of territory. I thought it took a while.

I hope it is useful for you!

Regards,

Steffen

Appreciations

Thanks to Nicolas Droux for a quick reply to my question on the VNIC behavior when the link is down, and his ongoing internal answers to my deeper Solaris networking questions.

Thanks to Jeff Savit for a quick review and editorial suggestions. He and I discussed the need this topic several times.

Revision History

(Other than minor typographical changes)

2016.04.22: Posted

2016.04.21: Created







        
    

Monday Sep 21, 2015

Configuring Secure NFS in Solaris 11

This entry goes through the steps to build a Secure NFS server and client configuration. This includes the necessary DNS server configuration, creating a single Kerberos Key Distribution Center, and configuring NFS server and client to force access using Secure NFS.[Read More]

Secure NFS: Step O, as in Optional--NTP and DNS

Optional Network Time Protocol and Domain Name System Setup for Kerberos

Kerberos requires in-sync system time across all systems utilizing the service. Solaris Kerberos also requires direct access to DNS, as it does not use the local name service switch to select host name resolution. Thus I start with the steps to set up NTP and DNS, should you need either or both.

NTP

Since my setup is using Solaris Zones on a single system, they share the Global Zone's clock, and thus all the Zones' times are in sync. When using Kerberos across multiple systems, it is suggested to keep clock skew at a minimum. You may be doing this already for other reasons. If not, here is a simple Network Time Protocol configuration. Your routers may be valid NTP servers.

I add several server references in /etc/inet/ntp.conf, which I base off of the provided /etc/inet/ntp.client file.

global# diff /etc/inet/ntp.conf /etc/inet/ntp.client
49,53d48
< server 0.us.pool.ntp.org iburst
< server 1.us.pool.ntp.org iburst
< server 2.us.pool.ntp.org iburst
< server 3.us.pool.ntp.org iburst
global#

Replace the "x.us.pool.ntp.org" with your NTP servers' IP addresses or hostnames.

DNS

DNS infrastructure is required for Kerberos. Solaris' Kerberos is compiled to use DNS to do hostname lookups. See Kerberos, DNS, and the Naming Service.

If you have DNS servers you can update or even just reference for the nodes you need, please use them. I you don't have that or don't want to use them, here are steps to set up your own DNS service. This will include a single DNS server. More available DNS is out of the scope of this entry.

Create the DNS server Solaris Zone

My Zone configuration file is as follows.

global# cat dns.cfg
create -b
set brand=solaris
set zonepath=/zones/dns
set autoboot=false
set autoshutdown=shutdown
set ip-type=exclusive
add anet
set linkname=net0
set lower-link=net1
set configure-allowed-address=true
set link-protection=mac-nospoof
set mac-address=random
set vlan-id=17
end
add anet
set linkname=net1
set lower-link=net0
set configure-allowed-address=true
set link-protection=mac-nospoof
set mac-address=random
end
add admin
set user=steffen
set auths=login,manage,config
end
global#

The Zone has two network interfaces. The first (linkname=net0) is on VLAN ID 17 and is for this Secure NFS setup. The second network interface (linkname=net1) ties into my local network, and also my local DNS server (my broadband router at home, or my office network's DNS server--that I can't get modified for my hostnames.)

I also set the Zone up so that I can administer it without becoming root, though all the examples here are as root.

I configure the zone using the dns.cfg configuration file as input.

global# zonecfg -z dns -f dns.cfg
UX: /usr/sbin/usermod: steffen is currently logged in, some changes may not take effect until next login.
global#

Then to speed things up I clone the Zone from a "master" zone I created in advance. On my system a clone takes less than 20 seconds, while an install, with a local IPS repository, takes about 90 seconds. Your times will vary based on your system, type of storage, and the network connection to the IPS repository you use.

global# zoneadm -z dns clone -c dns_profile.xml kdcmaster
The following ZFS file system(s) have been created:
    pool1/zones/dns
Progress being logged to /var/log/zones/zoneadm.20150901T012022Z.dns.clone
Log saved in non-global zone as /zones/dns/root/var/log/zones/zoneadm.20150901T012022Z.dns.clone
global#

Lets boot the Zone.

global# zoneadm -z dns boot
global#

Once the Zone is up and running, I like to create a new boot environment, so that I if have to revert the changes I made, I can just reboot into the existing new Zone. While creating a new Zone is fast, this save some work, and it is also convenient later on to test additional changes.

global# zlogin dns
[Connected to zone 'dns' pts/8]
Oracle Corporation	SunOS 5.11	11.2	July 2015
root@dns:~#

root@dns:~# beadm create dns
root@dns:~# beadm activate dns
root@dns:~# reboot

[Connection to zone 'dns' pts/8 closed]
global#

Install the DNS server in the Solaris Zone

The DNS server package service/network/dns/bind is not installed by default, so we have to install it. We can verify it is not there by testing for the service.

global# zlogin dns
[Connected to zone 'dns' pts/8]
Oracle Corporation	SunOS 5.11	11.2	July 2015
root@dns:~#

root@dns:~# svcs *dns*
STATE          STIME    FMRI
disabled       21:26:25 svc:/network/dns/multicast:default
online         21:26:29 svc:/network/dns/client:default
root@dns:~#

root@dns:~# pkg install pkg:/service/network/dns/bind
           Packages to install:  1
            Services to change:  1
       Create boot environment: No
Create backup boot environment: No
DOWNLOAD                                PKGS         FILES    XFER (MB)   SPEED
Completed                                1/1         38/38      1.4/1.4  9.2M/s

PHASE                                          ITEMS
Installing new actions                         74/74
Updating package state database                 Done
Updating package cache                           0/0
Updating image state                            Done
Creating fast lookup database                   Done
Updating package cache                           2/2
root@dns:~#

root@dns:~# svcs *dns*
STATE          STIME    FMRI
disabled       21:26:25 svc:/network/dns/multicast:default
disabled       21:27:17 svc:/network/dns/server:default
online         21:26:29 svc:/network/dns/client:default
root@dns:~#

Configured the DNS server

With the DNS server package installed, it is time to create a basic DNS server configuration. I am using network 172.17.0.0/22 for some historical reasons. You can adjust to meet your own preferences or local requirements.

Some preliminary work for my configuration. My Zone configuration, if you remember, has two networks. The syconfig profile configured net0 for my private network. I still need to configure net1 on my standard network. I will use DHCP to get an address.

root@dns:~# dladm show-link
LINK                CLASS     MTU    STATE    OVER
net0                vnic      1500   up       ?
net1                vnic      1500   up       ?
root@dns:~#
root@dns:~# ipadm show-addr
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
net0/v4           static   ok           172.17.0.250/22
lo0/v6            static   ok           ::1/128
net0/v6           addrconf ok           fe80::8:20ff:fe90:a16e/10
root@dns:~#
root@dns:~# ipadm create-ip net1
root@dns:~#
root@dns:~# ipadm create-addr -T dhcp net1
net1/v4
root@dns:~#
root@dns:~# ipadm show-addr
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
net0/v4           static   ok           172.17.0.250/22
net1/v4           dhcp     ok           192.168.1.112/24
lo0/v6            static   ok           ::1/128
net0/v6           addrconf ok           fe80::8:20ff:fe90:a16e/10
root@dns:~#

It is time to create the master DNS file in /etc/named.conf. Some items of note include:

  • My two subnets, 172.17.0.0/22 and 192.168.1.0/24
  • I have ACLs to allow access from my two subnets
  • I set a forward to my local DNS server (my local router or my office network's DNS servers.)
  • I listen on the two networks listed in the ipadm output above.
  • This is set up for additional slave DNS servers, though I will not be showing the setup of that here.

Here is my final /etc/named.conf file.

root@dns:~# cat /etc/named.conf
//
// sample BIND configuration file
// taken from http://www.madboa.com/geek/soho-bind/
//

// Added acl per DNS setup at
// https://www.digitalocean.com/community/tutorials/how-to-configure-bind-as-a-caching-or-forwarding-dns-server-on-ubuntu-14-04
//
acl goodclients {
  172.17.0.0/22;
  192.168.1.0/24;
  localhost;
};

options {
  // tell named where to find files mentioned below
  directory "/var/named";
  // on a multi-homed host, you might want to tell named
  // to listen for queries only on certain interfaces
  listen-on { 127.0.0.1; 172.17.0.250/22; 192.168.1.112/24; };
  allow-query { goodclients; };
  forwarders { 192.168.1.1; };
};

// The single dot (.) is the root of all DNS namespace, so
// this zone tells named where to start looking for any
// name on the Internet
zone "." IN {
  // a hint type means that we've got to look elsewhere
  // for authoritative information
  type hint;
  file "named.root";
};

// Where the localhost hostname is defined
zone "localhost" IN {
  // a master type means that this server needn't look
  // anywhere else for information; the localhost buck
  // stops here.
  type master;
  file "zone.localhost";
  // don't allow dynamic DNS clients to update info
  // about the localhost zone
  allow-update { none; };
};

// Where the 127.0.0.0 network is defined
zone "0.0.127.in-addr.arpa" IN {
  type master;
  file "revp.127.0.0";
  allow-update { none; };
};

zone "steffentw.com" IN {
  // this is the authoritative server for
  // steffentw.com info
  type master;
  file "zone.com.steffentw";
  also-notify { 172.17.0.251; 172.17.0.252; };
};

zone "0.17.172.in-addr.arpa" {
  // this is the authoritative server for
  // the 172.17.0.0/22 network
  type master;
  file "revp.172.17.0.0";
  also-notify { 172.17.0.251; 172.17.0.252; };
};
root@dns:~#

Now I have to create or update the files pointed to be /etc/named.conf with my local hostnames.

root@dns:~# cd /var/named
root@dns:/var/named# ls
named.root        revp.172.17.0.0   zone.localhost
revp.127.0.0      zone.com.steffentw
root@dns:/var/named#
root@dns:/var/named# cat zone.com.steffentw
;
; dns zone for for steffentw.com
;
; 20150827	Hide _nfsv4idmapdomain to test domainname(1M) response
; 20150824	Removed CNAME for kdc to see if this is required
;
$ORIGIN steffentw.com.
$TTL 1M				; set to 1M for testing, was 1D
; any time you make a change to the domain, bump the
; "serial" setting below. the format is easy:
; YYYYMMDDI, with the I being an iterator in case you
; make more than one change during any one day
@	IN SOA   dns hostmaster (
			201508311 ; serial
			8H        ; refresh
			4M        ; retry
			1H        ; expire
			1D        ; minimum
			)
; dns.steffentw.com serves this domain as both the
; name server (NS) and mail exchange (MX)
		NS	dns
		MX	10 dns
; define domain functions with CNAMEs
depot           CNAME   dns
www             CNAME   dns
; for NFSv4 (2015.08.12)
;_nfsv4idmapdomain	IN TXT	"steffentw.com"
; just in case someone asks for localhost.steffentw.com
localhost	A	127.0.0.1
;
;	172.17.0.0/22 Infrastructure Administration Network
;
host1		A	172.17.0.101
host2		A	172.17.0.102
host3		A	172.17.0.103
host4		A	172.17.0.104
host5		A	172.17.0.105
host6		A	172.17.0.106
host7		A	172.17.0.107
host8		A	172.17.0.108
host9		A	172.17.0.109
zfs1		A	172.17.0.201
zfs2		A	172.17.0.202
zfs3		A	172.17.0.203
dns		A	172.17.0.250
kdc1		A	172.17.0.251
kdc2		A	172.17.0.252
kdc3		A	172.17.0.253
root@dns:/var/named#
root@dns:/var/named# cat revp.172.17.0.0
;
; reverse pointers for 172.17.0.0 subnet
;
$ORIGIN 0.16.172.in-addr.arpa.
$TTL 1D
@	IN SOA  dns.steffentw.com. hostmaster.steffentw.com. (
		201508311  ; serial
		28800      ; refresh (8 hours)
		14400      ; retry (4 hours)
		2419200    ; expire (4 weeks)
		86400      ; minimum (1 day)
		)
; define the authoritative name server
		NS	dns.steffentw.com.
;		NS	dns1.steffentw.com.
;		NS	dns2.steffentw.com.
;
;       172.17.0.0/22 Infrastructure Administration Network
;
101	PTR	host1.steffentw.com.
102	PTR	host2.steffentw.com.
103	PTR	host3.steffentw.com.
104	PTR	host4.steffentw.com.
105	PTR	host5.steffentw.com.
106	PTR	host6.steffentw.com.
107	PTR	host7.steffentw.com.
108	PTR	host8.steffentw.com.
109	PTR	host9.steffentw.com.
;
201	PTR	zfs1.steffentw.com.
202	PTR	zfs2.steffentw.com.
203	PTR	zfs3.steffentw.com.
;
250	PTR	dns.steffentw.com.
251	PTR	kdc1.steffentw.com.
252	PTR	kdc2.steffentw.com.
253	PTR	kdc3.steffentw.com.
root@dns:/var/named#

With those files created it is time to enable the DNS server. Keep an eye out on the console of the Zone in case you have errors.

root@dns:/var/named# svcs *dns*
STATE          STIME    FMRI
disabled       21:26:25 svc:/network/dns/multicast:default
disabled       21:27:17 svc:/network/dns/server:default
online         21:26:29 svc:/network/dns/client:default
root@dns:/var/named#
root@dns:/var/named# svcadm enable dns/server
root@dns:/var/named#
root@dns:/var/named# svcs *dns*
STATE          STIME    FMRI
disabled       21:26:25 svc:/network/dns/multicast:default
online         21:26:29 svc:/network/dns/client:default
online         21:44:31 svc:/network/dns/server:default
root@dns:/var/named#

Test the DNS server

Let us see if DNS really works.

root@dns:~# getent hosts kdc1
172.17.0.251	kdc1.steffentw.com
root@dns:~# getent hosts host1
172.17.0.101	host1.steffentw.com
root@dns:~#

A quick test to see if this Zone can do a DNS lookup for an external name.

root@dns:~# nslookup www.oracle.com
Server:		172.17.0.250
Address:	172.17.0.250#53

Non-authoritative answer:
www.oracle.com	canonical name = www.oracle.com.edgekey.net.
www.oracle.com.edgekey.net	canonical name = e7075.x.akamaiedge.net.
Name:	e7075.x.akamaiedge.net
Address: 23.66.214.140

root@dns:~#
root@dns:~# getent hosts www.oracle.com
23.66.214.140	e7075.x.akamaiedge.net www.oracle.com www.oracle.com.edgekey.net
root@dns:~#

Summary and Next Step

With NTP and DNS working, the next step is to build the Key Distribution Server. Either go to KDC setup or back to the introduction.

Secure NFS: Step 1--Setting Up the Kerberos KDC

Kerberos KDC

With DNS set up, the next service to configure is the Key Distribution Center. It will need to access DNS services.

Creating the KDC Zone

The Zone configuration is similar to the DNS server, with the interface using VLAN ID 17 in my setup.

global# cat kdc1.cfg
create -b
set brand=solaris
set zonepath=/zones/kdc1
set autoboot=false
set autoshutdown=shutdown
set ip-type=exclusive
add anet
set linkname=net0
set lower-link=net1
set configure-allowed-address=true
set link-protection=mac-nospoof
set mac-address=random
set vlan-id=17
end
add admin
set user=steffen
set auths=login,manage,config
end
global#

Since the KDC must use DNS, lets put that into the sysconfig profile.

global# more kdc1_profile.xml
...
  <service version="1" type="service" name="network/install">
    <instance enabled="true" name="default">
      <property_group type="application" name="install_ipv6_interface">
        <propval type="astring" name="stateful" value="yes"/>
        <propval type="astring" name="address_type" value="addrconf"/>
        <propval type="astring" name="name" value="net0/v6"/>
        <propval type="astring" name="stateless" value="yes"/>
      </property_group>
      <property_group type="application" name="install_ipv4_interface">
        <propval type="net_address_v4" name="static_address" value="172.17.0.251 /24"/>
        <propval type="astring" name="name" value="net0/v4"/>
        <propval type="astring" name="address_type" value="static"/>
      </property_group>
    </instance>
  </service>
  <service version="1" type="service" name="network/physical">
    <instance enabled="true" name="default">
      <property_group type="application" name="netcfg">
        <propval type="astring" name="active_ncp" value="DefaultFixed"/>
      </property_group>
    </instance>
  </service>
  <service version="1" type="service" name="system/name-service/switch">
    <property_group type="application" name="config">
      <propval type="astring" name="default" value="files"/>
      <propval type="astring" name="host" value="files dns"/>
    </property_group>
    <instance enabled="true" name="default"/>
  </service>
  <service version="1" type="service" name="network/dns/client">
    <property_group type="application" name="config">
      <property type="net_address" name="nameserver">
        <net_address_list>
          <value_node value="172.17.0.250"/>
        </net_address_list>
      </property>
      <property type="astring" name="search">
        <astring_list>
          <value_node value="steffentw.com"/>
        </astring_list>
      </property>
    </property_group>
    <instance enabled="true" name="default"/>
  </service>
  ...
global#

Configure and clone the KDC Zone.

global# zonecfg -z kdc1 -f kdc1.cfg
UX: /usr/sbin/usermod: steffen is currently logged in, some changes may not take effect until next login.
global#
global#
global# zoneadm -z kdc1 clone -c kdc1_profile.xml kdcmaster
The following ZFS file system(s) have been created:
    pool1/zones/kdc1
Progress being logged to /var/log/zones/zoneadm.20150901T204046Z.kdc1.clone
Log saved in non-global zone as /zones/kdc1/root/var/log/zones/zoneadm.20150901T204046Z.kdc1.clone
global#
global# zoneadm -z kdc1 boot
global#

After logging into the KDC Zones, first verify that DNS is configured properly.

global#
global# zlogin kdc1
[Connected to zone 'kdc1' pts/8]
Oracle Corporation	SunOS 5.11	11.2	July 2015
root@kdc1:~#
root@kdc1:~# getent hosts host1
172.17.0.101	host1.steffentw.com
root@kdc1:~#

Installing the Kerberos Server Software

The necessary KDC package is not installed by default.

root@kdc1:~# svcs *krb5* ; svcs *kerb*
STATE          STIME    FMRI
STATE          STIME    FMRI
disabled       16:41:20 svc:/system/kerberos/install:default
root@kdc1:~#

Again I prefer to create an alternate boot environment. This time I will do it as part of the package installation.

root@kdc1:~# pkg install --be-name kdc system/security/kerberos-5
           Packages to install:   1
       Create boot environment: Yes
Create backup boot environment:  No
DOWNLOAD                                PKGS         FILES    XFER (MB)   SPEED
Completed                                1/1         41/41      0.7/0.7 27.9M/s

PHASE                                          ITEMS
Installing new actions                         90/90
Updating package state database                 Done
Updating package cache                           0/0
Updating image state                            Done
Creating fast lookup database                   Done
Updating package cache                           2/2

A clone of solaris-0 exists and has been updated and activated.
On the next boot the Boot Environment kdc will be
mounted on '/'.  Reboot when ready to switch to this updated BE.

Updating package cache                           2/2
root@kdc1:~#

A quick check on the BE, and then boot into it.

root@kdc1:~# beadm list
BE        Flags Mountpoint Space  Policy Created         
--        ----- ---------- -----  ------ -------         
kdc       R     -          95.45M static 2015-09-01 16:47
solaris-0 N     /          6.29M  static 2015-09-01 16:40
root@kdc1:~#
root@kdc1:~# reboot

[Connection to zone 'kdc1' pts/8 closed]
global#

First lets confirm the necessary services are there.

global# zlogin kdc1
[Connected to zone 'kdc1' pts/8]
Oracle Corporation	SunOS 5.11	11.2	July 2015
root@kdc1:~#
root@kdc1:~# svcs *krb5* ; svcs *kerb*
STATE          STIME    FMRI
disabled       16:48:22 svc:/network/security/krb5_prop:default
disabled       16:48:22 svc:/network/security/krb5kdc:default
STATE          STIME    FMRI
disabled       16:48:21 svc:/system/kerberos/install:default
root@kdc1:~#

Configuring the KDC

The first configuration step is to modify two files. I make a copy for backups and to compare the new to the original here.

root@kdc1:~# cd /etc/krb5/
root@kdc1:/etc/krb5#
root@kdc1:/etc/krb5# cp -p kdc.conf kdc.conf.orig
root@kdc1:/etc/krb5# cp -p krb5.conf krb5.conf.orig
root@kdc1:/etc/krb5#
root@kdc1:/etc/krb5# vi kdc.conf
root@kdc1:/etc/krb5#
root@kdc1:/etc/krb5# cat kdc.conf
#
#
# Copyright (c) 2008, Oracle and/or its affiliates. All rights reserved.
#

[kdcdefaults]
	kdc_ports = 88,750

[realms]
	___default_realm___ = {
		profile = /etc/krb5/krb5.conf
		database_name = /var/krb5/principal
		acl_file = /etc/krb5/kadm5.acl
		kadmind_port = 749
		max_life = 8h 0m 0s
		max_renewable_life = 7d 0h 0m 0s
		default_principal_flags = +preauth
 		master_key_type = des3-cbc-sha1-kd
 		supported_enctypes = des3-cbc-sha1-kd:normal
	}
root@kdc1:/etc/krb5#
root@kdc1:/etc/krb5# diff kdc.conf*
18,19d17
<  		master_key_type = des3-cbc-sha1-kd
<  		supported_enctypes = des3-cbc-sha1-kd:normal
root@kdc1:/etc/krb5#
root@kdc1:/etc/krb5# vi krb5.conf
root@kdc1:/etc/krb5#
root@kdc1:/etc/krb5# head -20 krb5.conf
#
#
# Copyright (c) 2007, Oracle and/or its affiliates. All rights reserved.
#

# krb5.conf template
# In order to complete this configuration file
# you will need to replace the ____ placeholders
# with appropriate values for your network and uncomment the
# appropriate entries.
#
[libdefaults]
#        default_realm = ___default_realm___
 	default_tgs_enctypes = des3-cbc-sha1-kd
 	default_tkt_enctypes = des3-cbc-sha1-kd
 	permitted_enctypes = des3-cbc-sha1-kd
 	allow_weak_enctypes = false


[realms]
root@kdc1:/etc/krb5#
root@kdc1:/etc/krb5# diff krb5.conf*
14,17d13
<  	default_tgs_enctypes = des3-cbc-sha1-kd
<  	default_tkt_enctypes = des3-cbc-sha1-kd
<  	permitted_enctypes = des3-cbc-sha1-kd
<  	allow_weak_enctypes = false
19d14
<
root@kdc1:/etc/krb5#

Since my sample domain name is steffentw.com, my Kerberos realm is STEFFENTW.COM. Here I create the master KDC. It will prompt for two sets of password, make sure the remember them. The admin password will be required on all the clients.

root@kdc1:/etc/krb5# kdcmgr -a kws/admin -r STEFFENTW.COM create master

Starting server setup
---------------------------------------------------

Setting up /etc/krb5/kdc.conf.

Setting up /etc/krb5/krb5.conf.

Initializing database '/var/krb5/principal' for realm 'STEFFENTW.COM',
master key name 'K/M@STEFFENTW.COM'
You will be prompted for the database Master Password.
It is important that you NOT FORGET this password.
Enter KDC database master key: enter master password here
Re-enter KDC database master key to verify: enter master password here

Authenticating as principal root/admin@STEFFENTW.COM with password.
WARNING: no policy specified for kws/admin@STEFFENTW.COM; defaulting to no policy
Enter password for principal "kws/admin@STEFFENTW.COM": enter admin password here
Re-enter password for principal "kws/admin@STEFFENTW.COM": enter admin password here
Principal "kws/admin@STEFFENTW.COM" created.

Setting up /etc/krb5/kadm5.acl.

---------------------------------------------------
Setup COMPLETE.

root@kdc1:/etc/krb5#

Once the configuration is complete, I quickly check to make sure it looks OK. I especially look for kadmin:default to be online.

root@kdc1:/etc/krb5# kdcmgr status

KDC Status Information
--------------------------------------------
svc:/network/security/krb5kdc:default (Kerberos key distribution center)
 State: online since September  1, 2015 04:51:06 PM EDT
   See: man -M /usr/share/man -s 1M krb5kdc
   See: /var/svc/log/network-security-krb5kdc:default.log
Impact: None.

KDC Master Status Information
--------------------------------------------
svc:/network/security/kadmin:default (Kerberos administration daemon)
 State: online since September  1, 2015 04:51:07 PM EDT
   See: man -M /usr/share/man -s 1M kadmind
   See: /var/svc/log/network-security-kadmin:default.log
Impact: None.

Transaction Log Information
--------------------------------------------

Kerberos update log (/var/krb5/principal.ulog)
Update log dump :
	Log version # : 1
	Log state : Stable
	Entry block size : 2048
	Number of entries : 3
	First serial # : 1
	Last serial # : 3
	First time stamp : Tue Sep  1 16:51:06 2015
	Last time stamp : Tue Sep  1 16:51:06 2015


Kerberos Related File Information
--------------------------------------------
(will display any missing files below)

root@kdc1:/etc/krb5#

Enabling Kerberos Client Configuration

With the KDC set up, the next step is to make is easier to configure the Kerberos clients. Two files are required, and by putting them into a location that is shared via NFS, setting up the clients will be very easy.

Step 1 is to create a mountpoint.

root@kdc1:/etc/krb5# zfs create -o mountpoint=/share -o share.nfs=on rpool/share
root@kdc1:/etc/krb5#
root@kdc1:/etc/krb5# share
rpool_share	/share	nfs	sec=sys,rw	
root@kdc1:/etc/krb5#

Step 2 is to create the file kcprofile

root@kdc1:/etc/krb5# mkdir /share/krb5
root@kdc1:/etc/krb5#
root@kdc1:/etc/krb5# vi /share/krb5/kcprofile
root@kdc1:/etc/krb5#
root@kdc1:/etc/krb5# cat /share/krb5/kcprofile
REALM STEFFENTW.COM
KDC kdc1.steffentw.com
ADMIN kws
FILEPATH /net/kdc1.steffentw.com/share/krb5/krb5.conf
NFS 1
DNSLOOKUP none
root@kdc1:/etc/krb5#
root@kdc1:/etc/krb5# cp /etc/krb5/krb5.conf /share/krb5/
root@kdc1:/etc/krb5#
root@kdc1:/etc/krb5# cat /share/krb5/krb5.conf 
[libdefaults]
	default_realm = STEFFENTW.COM

[realms]
	STEFFENTW.COM = {
		kdc = kdc1.steffentw.com
		admin_server = kdc1.steffentw.com
	}

[domain_realm]
	.steffentw.com = STEFFENTW.COM

[logging]
	default = FILE:/var/krb5/kdc.log
	kdc = FILE:/var/krb5/kdc.log
	kdc_rotate = {
		period = 1d
		versions = 10
	}

[appdefaults]
	kinit = {
		renewable = true
		forwardable = true
	}
root@kdc1:/etc/krb5#

Summary and Next Step

With the KDC set up, the next step is to create the first client and configure secure NFS. Either go to NFS Server Setup or back to the introduction.

Secure NFS: Step 2--First Keberos Client--NFS Server

Secure NFS Server

With our Kerberos KDC set up, it is time to build the NFS server. First step is creating another Solaris Zone similar to the previous ones.

Creating a NFS Server Zone

global# cat zfs1.cfg
create -b
set brand=solaris
set zonepath=/zones/zfs1
set autoboot=false
set autoshutdown=shutdown
set ip-type=exclusive
add anet
set linkname=net0
set lower-link=net2
set configure-allowed-address=true
set link-protection=mac-nospoof
set mac-address=random
set vlan-id=17
end
add admin
set user=steffen
set auths=login,manage,config
end
global#
global# zonecfg -z zfs1 -f zfs1.cfg
UX: /usr/sbin/usermod: steffen is currently logged in, some changes may not take effect until next login.
global#
global# zoneadm -z zfs1 clone -c zfs1_profile.xml kdcmaster
The following ZFS file system(s) have been created:
    pool1/zones/zfs1
Progress being logged to /var/log/zones/zoneadm.20150901T210134Z.zfs1.clone
Log saved in non-global zone as /zones/zfs1/root/var/log/zones/zoneadm.20150901T210134Z.zfs1.clone
global#
global# zoneadm -z zfs1 boot
global#

Configuring the Zone as a Kerberos Client

We also follow the same steps as for the previous KDC client.

global# zlogin zfs1
[Connected to zone 'zfs1' pts/10]
Oracle Corporation	SunOS 5.11	11.2	July 2015
root@zfs1:~#
root@zfs1:~# ping kdc1
kdc1 is alive
root@zfs1:~#
root@zfs1:~# cat /net/kdc1/share/krb5/kcprofile
REALM STEFFENTW.COM
KDC kdc1.steffentw.com
ADMIN kws
FILEPATH /net/kdc1.steffentw.com/share/krb5/krb5.conf
NFS 1
DNSLOOKUP none
root@zfs1:~#
root@zfs1:~# head -5 /net/kdc1.steffentw.com/share/krb5/krb5.conf
[libdefaults]
	default_realm = STEFFENTW.COM

[realms]
	STEFFENTW.COM = {
root@zfs1:~#
root@zfs1:~# kclient -p /net/kdc1/share/krb5/kcprofile

Starting client setup

---------------------------------------------------

Setting up /etc/krb5/krb5.conf.

Copied /net/kdc1.steffentw.com/share/krb5/krb5.conf to /system/volatile/kclient/kclient-krb5conf.MYaafI.
Obtaining TGT for kws/admin ...
Password for kws/admin@STEFFENTW.COM: enter admin password here
kinit:  no ktkt_warnd warning possible

nfs/zfs1.steffentw.com entry ADDED to KDC database.
nfs/zfs1.steffentw.com entry ADDED to keytab.

host/zfs1.steffentw.com entry ADDED to KDC database.
host/zfs1.steffentw.com entry ADDED to keytab.

---------------------------------------------------
Setup COMPLETE.

root@zfs1:~#
root@zfs1:~# klist -k
Keytab name: FILE:/etc/krb5/krb5.keytab
KVNO Principal
---- --------------------------------------------------------------------------
   2 nfs/zfs1.steffentw.com@STEFFENTW.COM
   2 nfs/zfs1.steffentw.com@STEFFENTW.COM
   2 nfs/zfs1.steffentw.com@STEFFENTW.COM
   2 nfs/zfs1.steffentw.com@STEFFENTW.COM
   2 host/zfs1.steffentw.com@STEFFENTW.COM
   2 host/zfs1.steffentw.com@STEFFENTW.COM
   2 host/zfs1.steffentw.com@STEFFENTW.COM
   2 host/zfs1.steffentw.com@STEFFENTW.COM
root@zfs1:~#

Configuring the NFS Server File System

With the NFS server a Kerberos client, now create a ZFS file system that is exported as an NFS share requiring Kerberos privacy settings (the "krb5p" setting.)

root@zfs1:~# zfs create -o mountpoint=/secure -o share.nfs=on -o share.nfs.sec=krb5p rpool/secure
root@zfs1:~# share
rpool_secure	/secure	nfs	sec=krb5p,rw	
root@zfs1:~#

Then create a file with some easily recognized content.

root@zfs1:~# echo "The quick brown fox jumps over the lazy dog." > /secure/fox.txt
root@zfs1:~#
root@host1:~# cat /secure/fox.txt
The quick brown fox jumps over the lazy dog.
root@zfs1:~#

Summary and Next Step

With the NFS server running, the next step is to create an NFS client. Either go to NFS Client Setup or back to the introduction.

Secure NFS: Step 3--The Secure NFS Client

Secure NFS Client

We are getting close to a fully completed configuration. The next item is the client.

Build the NFS Client Zone as a KDC Client

global# cat host1.cfg
create -b
set brand=solaris
set zonepath=/zones/host1
set autoboot=false
set autoshutdown=shutdown
set ip-type=exclusive
add anet
set linkname=net0
set lower-link=net2
set configure-allowed-address=true
set link-protection=mac-nospoof
set mac-address=random
set vlan-id=17
end
add admin
set user=steffen
set auths=login,manage,config
end
global#
global# zoneadm -z host1 clone -c host1_profile.xml kdcmaster
The following ZFS file system(s) have been created:
    pool1/zones/host1
Progress being logged to /var/log/zones/zoneadm.20150901T213207Z.host1.clone
Log saved in non-global zone as /zones/host1/root/var/log/zones/zoneadm.20150901T213207Z.host1.clone
global#
global# zlogin host1
[Connected to zone 'host1' pts/8]
Oracle Corporation	SunOS 5.11	11.2	July 2015
root@host1:~#
root@host1:~# ping kdc1
kdc1 is alive
root@host1:~#
root@host1:~# cat /net/kdc1/share/krb5/kcprofile
REALM STEFFENTW.COM
KDC kdc1.steffentw.com
ADMIN kws
FILEPATH /net/kdc1.steffentw.com/share/krb5/krb5.conf
NFS 1
DNSLOOKUP none
root@host1:~#
root@host1:~# kclient -p /net/kdc1/share/krb5/kcprofile

Starting client setup

---------------------------------------------------

Setting up /etc/krb5/krb5.conf.

Copied /net/kdc1.steffentw.com/share/krb5/krb5.conf to /system/volatile/kclient/kclient-krb5conf.ToaOPV.
Obtaining TGT for kws/admin ...
Password for kws/admin@STEFFENTW.COM: enter admin password here
kinit:  no ktkt_warnd warning possible

nfs/host1.steffentw.com entry ADDED to KDC database.
nfs/host1.steffentw.com entry ADDED to keytab.

host/host1.steffentw.com entry ADDED to KDC database.
host/host1.steffentw.com entry ADDED to keytab.

---------------------------------------------------
Setup COMPLETE.

root@host1:~#

Demonstrate the NFS Client Working

The simples test is to just navigate to the /net/<server name> location.

root@host1:~# cat /net/zfs1/secure/fox.txt
The quick brown fox jumps over the lazy dog.
root@host1:~#

However, was this really an encrypted data transfer? One way to check is with snoop(1M).

root@host1:~# snoop -d net0 -r host zfs1 &
[1] 21547
root@host1:~# Using device net0 (promiscuous mode)

root@host1:~# cat /net/zfs1/secure/fox.txt
The quick brown fox jumps over the lazy dog.
root@host1:~# 172.17.0.101 -> 172.17.0.201 TCP D=2049 S=1023 Syn Seq=1000276621 Len=0 Win=32804 Options=<mss 1460,sackOK,tstamp 129311831 0,nop,wscale 5>
172.17.0.201 -> 172.17.0.101 TCP D=1023 S=2049 Syn Ack=1000276622 Seq=576217546 Len=0 Win=32806 Options=<sackOK,tstamp 129311831 129311831,mss 1460,nop,wscale 5>
172.17.0.101 -> 172.17.0.201 TCP D=2049 S=1023 Ack=576217547 Seq=1000276622 Len=0 Win=32806 Options=<nop,nop,tstamp 129311831 129311831>
...
172.17.0.101 -> 172.17.0.201 RPC RPCSEC_GSS C NFS ver(4) proc(1) (data encrypted)
172.17.0.201 -> 172.17.0.101 TCP D=1023 S=2049 Ack=1000276950 Seq=576217547 Len=0 Win=32796 Options=<nop,nop,tstamp 129311831 129311831>
172.17.0.201 -> 172.17.0.101 RPC RPCSEC_GSS R NFS ver(4) proc(1) (data encrypted)
172.17.0.101 -> 172.17.0.201 TCP D=2049 S=1023 Ack=576217959 Seq=1000276950 Len=0 Win=32806 Options=<nop,nop,tstamp 129311832 129311832>
...
172.17.0.101 -> 172.17.0.201 RPC RPCSEC_GSS C NFS ver(4) proc(1) (data encrypted)
172.17.0.201 -> 172.17.0.101 RPC RPCSEC_GSS R NFS ver(4) proc(1) (data encrypted)
...
root@host1:~# kill %1
root@host1:~#

To see the difference, lets create a second share that does not require Kerberos.

root@zfs1:~# zfs create -o mountpoint=/clear -o share.nfs=on rpool/clear
root@zfs1:~#
root@zfs1:~# share
rpool_secure	/secure	nfs	sec=krb5p,rw	
rpool_clear	/clear	nfs	sec=sys,rw	
root@zfs1:~#
root@zfs1:~# cp /secure/fox.txt /clear/
root@zfs1:~#

And run snoop with the option to dump all the data in each Ethernet frame. I like to use -x 0.

First using encrypted mountpoint.

root@host1:~# snoop -d net0 -r -x 0 host zfs1 &
[1] 21591
root@host1:~# Using device net0 (promiscuous mode)

root@host1:~# cat /net/zfs1/secure/fox.txt
The quick brown fox jumps over the lazy dog.
root@host1:~# 172.17.0.101 -> 172.17.0.201 TCP D=2049 S=48428 Syn Seq=788443968 Len=0 Win=64240 Options=<mss 1460,sackOK,tstamp 129469208 0,nop,wscale 1>

	   0: 0208 20e4 7813 0208 20ea 4c3d 0800 4500    .. .x... .L=..E.
	  16: 003c ea59 4000 4006 0000 ac11 0065 ac11    .<.Y@.@......e..
	  32: 00c9 bd2c 0801 2efe b340 0000 0000 a002    ...,.....@......
	  48: faf0 597f 0000 0204 05b4 0402 080a 07b7    ..Y.............
	  64: 8b18 0000 0000 0103 0301                   ..........

172.17.0.201 -> 172.17.0.101 TCP D=48428 S=2049 Syn Ack=788443969 Seq=2268877688 Len=0 Win=32806 Options=<sackOK,tstamp 129469208 129469208,mss 1460,nop,wscale 5>

	   0: 0208 20ea 4c3d 0208 20e4 7813 0800 4500    .. .L=.. .x...E.
	  16: 003c f568 4000 4006 ec02 ac11 00c9 ac11    .<.h@.@.........
	  32: 0065 0801 bd2c 873c 5378 2efe b341 a012    .e...,.<Sx...A..
	  48: 8026 c6b9 0000 0402 080a 07b7 8b18 07b7    .&..............
	  64: 8b18 0204 05b4 0103 0305                   ..........

172.17.0.101 -> 172.17.0.201 TCP D=2049 S=48428 Ack=2268877689 Seq=788443969 Len=0 Win=64436 Options=<nop,nop,tstamp 129469208 129469208>

	   0: 0208 20e4 7813 0208 20ea 4c3d 0800 4500    .. .x... .L=..E.
	  16: 0034 ea5a 4000 4006 0000 ac11 0065 ac11    .4.Z@.@......e..
	  32: 00c9 bd2c 0801 2efe b341 873c 5379 8010    ...,.....A.<Sy..
	  48: fbb4 5977 0000 0101 080a 07b7 8b18 07b7    ..Yw............
	  64: 8b18                                       ..

...

172.17.0.101 -> 172.17.0.201 RPC RPCSEC_GSS C NFS ver(4) proc(1) (data encrypted)

	   0: 0208 20e4 7813 0208 20ea 4c3d 0800 4500    .. .x... .L=..E.
	  16: 017c ea70 4000 4006 0000 ac11 0065 ac11    .|.p@.@......e..
	  32: 00c9 03ff 0801 4667 92c6 2d1f 25fc 8018    ......Fg..-.%...
	  48: 8026 5abf 0000 0101 080a 07b7 8b1b 07b7    .&Z.............
	  64: 8b1b 8000 0144 6e7d 0f68 0000 0000 0000    .....Dn}.h......
	  80: 0002 0001 86a3 0000 0004 0000 0001 0000    ................
	  96: 0006 0000 0018 0000 0001 0000 0000 0000    ................
	 112: 0002 0000 0003 0000 0004 1e00 0000 0000    ................
	 128: 0006 0000 001c 0404 04ff ffff ffff 0000    ................
	 144: 0000 15d8 2a96 8cb9 33d6 91df d5de 4ee1    ....*...3.....N.
	 160: d51a 0000 00e4 0504 06ff 0000 0000 0000    ................
	 176: 0000 15d8 2a97 61c4 fa98 3b63 14d0 c5cb    ....*.a...;c....
	 192: 59ee 8848 1638 12bc 486e d73a 8b1e d704    Y..H.8..Hn.:....
	 208: 74e2 65e6 e036 6847 32e8 d2c8 a100 655b    t.e..6hG2.....e[
	 224: df06 73df 78d2 af8a 7850 193c a0bc 2147    ..s.x...xP.<..!G
	 240: 6073 7dcf 3038 cfbb 95d4 5f35 489c 65eb    `s}.08...._5H.e.
	 256: 1e54 3572 60c8 9b1e 78c8 f47a ac25 e8be    .T5r`...x..z.%..
	 272: ddd5 c104 8067 cf6a ca03 1327 c14d e5dd    .....g.j...'.M..
	 288: 0f06 2dac bac9 d689 7536 e391 0e3f 14dd    ..-.....u6...?..
	 304: 2f7b 33d1 231e 3b7b 0de5 5ee2 c28f cb54    /{3.#.;{..^....T
	 320: a2e0 2456 1ffa ddf0 c37f 42bf 252b 1667    ..$V......B.%+.g
	 336: 02c2 1fe3 b19d 0d7b 94a2 4e50 748b 5935    .......{..NPt.Y5
	 352: 890b 746c deb2 5744 97a4 4c07 83e4 5377    ..tl..WD..L...Sw
	 368: 4ca4 75e4 8081 f196 6f01 63fd 4e56 bee9    L.u.....o.c.NV..
	 384: 5510 c21a 6b6a 2d63 c326                   U...kj-c.&

172.17.0.201 -> 172.17.0.101 RPC RPCSEC_GSS R NFS ver(4) proc(1) (data encrypted)

	   0: 0208 20ea 4c3d 0208 20e4 7813 0800 4500    .. .L=.. .x...E.
	  16: 01d0 f57e 4000 4006 ea58 ac11 00c9 ac11    ...~@.@..X......
	  32: 0065 0801 03ff 2d1f 25fc 4667 940e 8018    .e....-.%.Fg....
	  48: 8026 8344 0000 0101 080a 07b7 8b1b 07b7    .&.D............
	  64: 8b1b 8000 0198 6e7d 0f68 0000 0001 0000    ......n}.h......
	  80: 0000 0000 0006 0000 001c 0404 05ff ffff    ................
	  96: ffff 0000 0000 22a9 1433 c781 6e9e 8ed8    ......"..3..n...
	 112: e6cc aa86 e4d9 0000 0000 0000 0160 0504    .............`..
	 128: 07ff 0000 0000 0000 0000 22a9 1434 68c0    .........."..4h.
	 144: e008 d7e8 cca4 af88 da90 2b45 dc13 57b9    ..........+E..W.
	 160: 3a0a e3f8 5a98 fddb 5039 62bc 1858 ecd5    :...Z...P9b..X..
	 176: 0f5c fcd6 a150 7bf0 0782 d337 8cf6 8de1    .\...P{....7....
	 192: 5e81 481f b921 9054 d74a 0160 e9a4 0522    ^.H..!.T.J.`..."
	 208: 8d85 f55d 9576 f819 6515 c010 8d22 d0a4    ...].v..e...."..
	 224: e685 0b00 ebd9 cb9b 4079 dcd1 1195 5690    ........@y....V.
	 240: 9d07 846b a8e0 f022 c33d 7412 5065 3bc5    ...k...".=t.Pe;.
	 256: 0be5 7f98 9cb5 f5cb 8452 aa0a dfa7 cfb3    .........R......
	 272: e9eb a607 03a8 59c9 dc62 903c b289 dd13    ......Y..b.<....
	 288: b20f 612d 1603 c335 2705 61ce af13 b792    ..a-...5'.a.....
	 304: 442e 5a19 59fb d867 377e 34f3 b43d f8e3    D.Z.Y..g7~4..=..
	 320: ff0a 2937 d04c 1b22 0213 5227 57f1 ba26    ..)7.L."..R'W..&
	 336: 44e0 5e52 2f79 41d9 a494 cee6 bd76 f8e0    D.^R/yA......v..
	 352: ecd1 4b98 0e91 7b09 321e 97b1 26ef 3cdc    ..K...{.2...&.<.
	 368: 7211 7ae3 b71c 3bb0 c1b0 2e91 93e2 2b37    r.z...;.......+7
	 384: a1de 76ca f736 70c4 4987 b39f 71e9 736f    ..v..6p.I...q.so
	 400: fc6e 433e 5f2f f283 06b6 cf1b 96f8 b447    .nC>_/.........G
	 416: af39 1d95 6fe7 4173 e554 2d77 c9b8 df88    .9..o.As.T-w....
	 432: 48d2 843e 67cb 54a2 93c8 8bad b24c 1e40    H..>g.T......L.@
	 448: 64aa 7f75 5fec a0c6 4d58 de19 ec68 25d3    d..u_...MX...h%.
	 464: af93 6f26 e12f 180b f0c0 87b6 7df6         ..o&./......}.

...

172.17.0.101 -> 172.17.0.201 NFS R CB_NULL

	   0: 0208 20e4 7813 0208 20ea 4c3d 0800 4500    .. .x... .L=..E.
	  16: 0050 ea7c 4000 4006 0000 ac11 0065 ac11    .P.|@.@......e..
	  32: 00c9 b385 ed12 c833 5144 9614 5a3c 8018    .......3QD..Z<..
	  48: 8026 5993 0000 0101 080a 07b7 8b1d 07b7    .&Y.............
	  64: 8b1a 8000 0018 627d 0f68 0000 0001 0000    ......b}.h......
	  80: 0000 0000 0000 0000 0000 0000 0000         ..............

172.17.0.201 -> 172.17.0.101 TCP D=45957 S=60690 Ack=3358806368 Seq=2517916220 Len=0 Win=32806 Options=<nop,nop,tstamp 129469213 129469213>

	   0: 0208 20ea 4c3d 0208 20e4 7813 0800 4500    .. .L=.. .x...E.
	  16: 0034 f58a 4000 4006 ebe8 ac11 00c9 ac11    .4..@.@.........
	  32: 0065 ed12 b385 9614 5a3c c833 5160 8010    .e......Z<.3Q`..
	  48: 8026 cd1f 0000 0101 080a 07b7 8b1d 07b7    .&..............
	  64: 8b1d                                       ..

172.17.0.101 -> 172.17.0.201 TCP D=2049 S=1023 Ack=757019588 Seq=1181196406 Len=0 Win=32806 Options=<nop,nop,tstamp 129469216 129469211>

	   0: 0208 20e4 7813 0208 20ea 4c3d 0800 4500    .. .x... .L=..E.
	  16: 0034 ea7d 4000 4006 0000 ac11 0065 ac11    .4.}@.@......e..
	  32: 00c9 03ff 0801 4667 a076 2d1f 33c4 8010    ......Fg.v-.3...
	  48: 8026 5977 0000 0101 080a 07b7 8b20 07b7    .&Yw......... ..
	  64: 8b1b                                       ..


root@host1:~#

And now using the clear text mount point.

root@host1:~# snoop -d net0 -r -x 0 host zfs1 &
[1] 21593
root@host1:~# Using device net0 (promiscuous mode)

root@host1:~# cat /net/zfs1/clear/fox.txt
The quick brown fox jumps over the lazy dog.
...

172.17.0.201 -> 172.17.0.101 NFS R 4 (read        ) NFS4_OK PUTFH NFS4_OK READ NFS4_OK (45 bytes) EOF

	   0: 0208 20ea 4c3d 0208 20e4 7813 0800 4500    .. .L=.. .x...E.
	  16: 00b0 f594 4000 4006 eb62 ac11 00c9 ac11    ....@.@..b......
	  32: 0065 0801 03ff 2d1f 3ba8 4667 a8d2 8018    .e....-.;.Fg....
	  48: 8026 f4c5 0000 0101 080a 07b7 9377 07b7    .&...........w..
	  64: 9377 8000 0078 917d 0f68 0000 0001 0000    .w...x.}.h......
	  80: 0000 0000 0000 0000 0000 0000 0000 0000    ................
	  96: 0000 0000 000c 7265 6164 2020 2020 2020    ......read     
	 112: 2020 0000 0002 0000 0016 0000 0000 0000      ..............
	 128: 0019 0000 0000 0000 0001 0000 002d 5468    .............-Th
	 144: 6520 7175 6963 6b20 6272 6f77 6e20 666f    e quick brown fo
	 160: 7820 6a75 6d70 7320 6f76 6572 2074 6865    x jumps over the
	 176: 206c 617a 7920 646f 672e 0a00 0000          lazy dog.....

...

172.17.0.101 -> 172.17.0.201 TCP D=2049 S=1023 Ack=757021992 Seq=1181198770 Len=0 Win=32806 Options=<nop,nop,tstamp 129471358 129471351>

	   0: 0208 20e4 7813 0208 20ea 4c3d 0800 4500    .. .x... .L=..E.
	  16: 0034 ea89 4000 4006 0000 ac11 0065 ac11    .4..@.@......e..
	  32: 00c9 03ff 0801 4667 a9b2 2d1f 3d28 8010    ......Fg..-.=(..
	  48: 8026 5977 0000 0101 080a 07b7 937e 07b7    .&Yw.........~..
	  64: 9377                                       .w


root@host1:~#

In both cases, because I let automounter time out and a new mount is initiated in each case, the are so many packets it is hard to know which is doing what. However, in the case of reading the file on /clear the "quick brown fox" text is clearly visibale. Your own tests and snoop output should make this difference very clear.

By default, the mounts use NFS version 4 (NFSv4). You can mount stating you want version 3. The results will be the same.

Additional NFS Client Configuration Options

root@host1:~# mount -o vers=3 zfs1:/secure /mnt
root@host1:~#

And as a reminder you can force mounts to use version 3 on either a client or a server using the sharectl(1M) command.

root@host1:~# sharectl get -p client_versmax nfs
client_versmax=4
root@host1:~#
root@host1:~# sharectl set -p client_versmax=3 nfs
root@host1:~# sharectl get -p client_versmax nfs
client_versmax=3
root@host1:~#

Summary and Next Step

This completes the Secure NFS setup. One option is to co-located the KDC and NFS server. Either go to Combining KDC and NFS Server or back to the introduction.

Secure NFS: Step 4--Combining the KDC and NFS Server

Combining the KDC and NFS Server

When I asked my customer about their availability requirements, they stated that they only need a few NFS clients with encrypted traffic. They would like to keep the setup simple, and therefore combine the KDC and NFS server. They are using Oracle Solaris Cluster for availability, and by putting both services in a single Solaris Zone, can meet their availability requirements with Oracle Solaris Cluster managing the Solaris Zone startup and failover.

So I looked into whether this is a good idea, and I was informed that this is fully supported and tested. They way to do this is to make the KDC a client of itself.

Making the KDC a Kerberos Client

root@kdc1:~# kclient -p /net/kdc1/share/krb5/kcprofile

Starting client setup

---------------------------------------------------

Setting up /etc/krb5/krb5.conf.

Copied /net/kdc1.steffentw.com/share/krb5/krb5.conf to /system/volatile/kclient/kclient-krb5conf.mmayyQ.
Obtaining TGT for kws/admin ...
Password for kws/admin@STEFFENTW.COM:
kinit:  no ktkt_warnd warning possible

nfs/kdc1.steffentw.com entry ADDED to KDC database.
nfs/kdc1.steffentw.com entry ADDED to keytab.

host/kdc1.steffentw.com entry already exists in KDC database.
host/kdc1.steffentw.com entry already present in keytab.
host/kdc1.steffentw.com entry ADDED to keytab.

---------------------------------------------------
Setup COMPLETE.

root@kdc1:~#
root@kdc1:~# klist -k
Keytab name: FILE:/etc/krb5/krb5.keytab
KVNO Principal
---- --------------------------------------------------------------------------
   3 host/kdc1.steffentw.com@STEFFENTW.COM
   3 host/kdc1.steffentw.com@STEFFENTW.COM
   3 host/kdc1.steffentw.com@STEFFENTW.COM
   3 host/kdc1.steffentw.com@STEFFENTW.COM
   2 nfs/kdc1.steffentw.com@STEFFENTW.COM
   2 nfs/kdc1.steffentw.com@STEFFENTW.COM
   2 nfs/kdc1.steffentw.com@STEFFENTW.COM
   2 nfs/kdc1.steffentw.com@STEFFENTW.COM
root@kdc1:~#

Creating Secured NFS Share

Then create a new mount point and put some data into it.

root@kdc1:~# zfs create -o mountpoint=/secure -o share.nfs=on -o share.nfs.sec=krb5p rpool/secure
root@kdc1:~#
root@kdc1:~# share
rpool_share     /share  nfs     sec=sys,rw     
rpool_secure    /secure nfs     sec=krb5p,rw   
root@kdc1:~#
root@kdc1:~# cp /net/zfs1/secure/fox.txt /secure/
root@kdc1:~#
root@kdc1:~# cat /secure/fox.txt
The quick brown fox jumps over the lazy dog.
root@kdc1:~#

Back on the client, read the file on the KDC, with snoop running to show data is encrypted. And since the maximum client version was set to version 3, the snoop shows that as well.

root@host1:~# snoop -d net0 -r host kdc1 &
[1] 21825
root@host1:~# Using device net0 (promiscuous mode)

root@host1:~# cat /net/kdc1/secure/fox.txt
The quick brown fox jumps over the lazy dog.
root@host1:~# 172.17.0.101 -> 172.17.0.251 TCP D=2049 S=1022 Syn Seq=597683294 Len=0 Win=32804 Options=<mss 1460,sackOK,tstamp 129789256 0,nop,wscale 5>
172.17.0.251 -> 172.17.0.101 TCP D=1022 S=2049 Syn Ack=597683295 Seq=1916087307 Len=0 Win=32806 Options=<sackOK,tstamp 129789256 129789256,mss 1460,nop,wscale 5>
172.17.0.101 -> 172.17.0.251 TCP D=2049 S=1022 Ack=1916087308 Seq=597683295 Len=0 Win=32806 Options=<nop,nop,tstamp 129789256 129789256>
172.17.0.101 -> 172.17.0.251 RPC RPCSEC_GSS C NFS ver(3) proc(1) (data encrypted)
172.17.0.251 -> 172.17.0.101 TCP D=1022 S=2049 Ack=597683495 Seq=1916087308 Len=0 Win=32806 Options=<nop,nop,tstamp 129789257 129789257>
172.17.0.251 -> 172.17.0.101 RPC RPCSEC_GSS R NFS ver(3) proc(1) (data encrypted)
172.17.0.101 -> 172.17.0.251 TCP D=2049 S=1022 Ack=1916087520 Seq=597683495 Len=0 Win=32806 Options=<nop,nop,tstamp 129789259 129789259>
172.17.0.101 -> 172.17.0.251 RPC RPCSEC_GSS C NFS ver(3) proc(4) (data encrypted)
172.17.0.251 -> 172.17.0.101 TCP D=1022 S=2049 Ack=597683699 Seq=1916087520 Len=0 Win=32806 Options=<nop,nop,tstamp 129789259 129789259>
172.17.0.251 -> 172.17.0.101 RPC RPCSEC_GSS R NFS ver(3) proc(4) (data encrypted)
172.17.0.101 -> 172.17.0.251 TCP D=2049 S=1022 Ack=1916087740 Seq=597683699 Len=0 Win=32806 Options=<nop,nop,tstamp 129789259 129789259>
172.17.0.101 -> 172.17.0.251 RPC RPCSEC_GSS C NFS ver(3) proc(1) (data encrypted)
172.17.0.251 -> 172.17.0.101 RPC RPCSEC_GSS R NFS ver(3) proc(1) (data encrypted)
172.17.0.101 -> 172.17.0.251 TCP D=2049 S=1022 Ack=1916087952 Seq=597683899 Len=0 Win=32806 Options=<nop,nop,tstamp 129789266 129789259>

root@host1:~#

Summary and Next Step

That is everything, I hope. Here you can quickly go back to the introduction.

Monday Aug 08, 2011

Solaris 11 Express Network Tunables

Overview

For years I, and many others, have been tuning TCP, UDP, IP, and other aspects of the Solaris network stack with ndd(1M). The ndd command is documented, however, most of the tunables were really private interface implementations, subject to change, and lacked documentation in many cases. Also, ndd does not show the default values, nor the possible values or ranges..

That is changing with Solaris 11 Express. A new command ipadm(1M) allows persistent and temporary (with the -t option) setting of key tunable values. This is a major improvement over ndd, where it is customary to create an /etc/rc2.d/S69ndd or similar script to set the parameter on every reboot. Another benefit is that ipadm shows the default value and the values that the property can be set to.

The ipadm has many features to configure the IP settings of interfaces. This blog entry focuses on how ipadm replaces ndd. Note that ipadm only supports the IP, TCP, UDP, SCTP, and ICMP protocols. Other protocols such as ipsecah and keysock still required the use of ndd.

Review of ndd

To get a list of all tunables for a specific protocol, an ndd -get operation is performed with "?" as the argument. For example, this is a way of listing all the TCP parameters.
root@Solaris11Express# ndd -get /dev/tcp \?
tcp_time_wait_interval         (read and write)
tcp_conn_req_max_q             (read and write)
tcp_conn_req_max_q0            (read and write)
tcp_conn_req_min               (read and write)
...
tcp_dev_flow_ctl               (read and write)
tcp_reass_timeout              (read and write)
tcp_extra_priv_ports_add       (write only)
tcp_extra_priv_ports_del       (write only)
tcp_extra_priv_ports           (read only)
tcp_1948_phrase                (write only)
tcp_listener_limit_conf        (read only)
tcp_listener_limit_conf_add    (write only)
tcp_listener_limit_conf_del    (write only)
To get the current value of specific parameter, list the parameter as the argument for the driver, in this case /dev/tcp.
root@Solaris11Express# ndd -get /dev/tcp tcp_conn_req_max_q
128
And to set parameter, follow it with a value.
root@Solaris11Express# ndd -set /dev/tcp tcp_conn_req_max_q 256
root@Solaris11Express# ndd -get /dev/tcp tcp_conn_req_max_q
256
And for my own benefit, I set it back to the original.
root@Solaris11Express# ndd -set /dev/tcp tcp_conn_req_max_q 128
root@Solaris11Express# ndd -get /dev/tcp tcp_conn_req_max_q
128

Using the ipadm *-prop Options

The ipadm(1M) manual page lists three sub-commands to manage TCP/IP protocol properties.
     ipadm set-prop [-t] -p prop=value[,...] protocol
     ipadm reset-prop [-t] -p prop protocol
     ipadm show-prop [[-c] -o field[,...]] [-p prop[,...]] [protocol]
To list all the properties for all the protocols as currently supported, I run ipadm with the show-prop sub-command.
root@Solaris11Express# ipadm show-prop
PROTO PROPERTY              PERM CURRENT      PERSISTENT   DEFAULT      POSSIBLE
ipv4  forwarding            rw   off          --           off          on,off
ipv4  ttl                   rw   255          --           255          1-255
ipv6  forwarding            rw   off          --           off          on,off
ipv6  hoplimit              rw   255          --           255          1-255
ipv6  hostmodel             rw   weak         --           weak         strong,
                                                                        src-priority,
                                                                        weak
ipv4  hostmodel             rw   weak         --           weak         strong,
                                                                        src-priority,
                                                                        weak
icmp  recv_maxbuf           rw   8192         --           8192         4096-65536
icmp  send_maxbuf           rw   8192         --           8192         4096-65536
tcp   ecn                   rw   passive      --           passive      never,passive,
                                                                        active
tcp   extra_priv_ports      rw   2049,4045    --           2049,4045    1-65535
tcp   largest_anon_port     rw   65535        --           65535        1024-65535
tcp   recv_maxbuf           rw   128000       --           128000       2048-1073741824
tcp   sack                  rw   active       --           active       never,passive,
                                                                        active
tcp   send_maxbuf           rw   49152        --           49152        4096-1073741824
tcp   smallest_anon_port    rw   32768        --           32768        1024-65535
tcp   smallest_nonpriv_port rw   1024         --           1024         1024-32768
udp   extra_priv_ports      rw   2049,4045    --           2049,4045    1-65535
udp   largest_anon_port     rw   65535        --           65535        1024-65535
udp   recv_maxbuf           rw   57344        --           57344        128-1073741824
udp   send_maxbuf           rw   57344        --           57344        1024-1073741824
udp   smallest_anon_port    rw   32768        --           32768        1024-65535
udp   smallest_nonpriv_port rw   1024         --           1024         1024-32768
sctp  extra_priv_ports      rw   2049,4045    --           2049,4045    1-65535
sctp  largest_anon_port     rw   65535        --           65535        1024-65535
sctp  recv_maxbuf           rw   102400       --           102400       8192-1073741824
sctp  send_maxbuf           rw   102400       --           102400       8192-1073741824
sctp  smallest_anon_port    rw   32768        --           32768        1024-65535
sctp  smallest_nonpriv_port rw   1024         --           1024         1024-32768
The first column lists the protocols. Of note is that there are separate IPv4 and IPv6 listings. Per the specification, there is no ttl for IPv6, as is seen by only an IPv4 property. IPv6 calls it the hoplimit, which is more indicative of how the value is actually used.

Including a protocol as an argument lists only those properties.

root@Solaris11Express# ipadm show-prop tcp
PROTO PROPERTY              PERM CURRENT      PERSISTENT   DEFAULT      POSSIBLE
tcp   ecn                   rw   passive      --           passive      never,passive,
                                                                        active
tcp   extra_priv_ports      rw   2049,4045    --           2049,4045    1-65535
tcp   largest_anon_port     rw   65535        --           65535        1024-65535
tcp   recv_maxbuf           rw   128000       --           128000       2048-1073741824
tcp   sack                  rw   active       --           active       never,passive,
                                                                        active
tcp   send_maxbuf           rw   49152        --           49152        4096-1073741824
tcp   smallest_anon_port    rw   32768        --           32768        1024-65535
tcp   smallest_nonpriv_port rw   1024         --           1024         1024-32768
We see the current value, whether we can set it, its default value, and the possible values or range of values. Self documenting. I like it!

To get a specific property, the -p option specifies which one to list.

root@Solaris11Express# ipadm show-prop -p send_maxbuf tcp
PROTO PROPERTY              PERM CURRENT      PERSISTENT   DEFAULT      POSSIBLE
tcp   send_maxbuf           rw   49152        --           49152        4096-1073741824
Now to set a property to a specific value, use the format property=value.
root@Solaris11Express# ipadm set-prop -p send_maxbuf=4096 tcp

root@Solaris11Express# ipadm show-prop -p send_maxbuf tcp
PROTO PROPERTY              PERM CURRENT      PERSISTENT   DEFAULT      POSSIBLE
tcp   send_maxbuf           rw   4096         4096         49152        4096-1073741824
The value of 4096 in the PERSISTENT column indicates this setting will be retained even after a reboot. To set the property only until the next reboot, use the -t option to set it temporarily.
root@Solaris11Express# ipadm set-prop -t -p send_maxbuf=4096 tcp

root@Solaris11Express# ipadm show-prop -p send_maxbuf tcp
PROTO PROPERTY              PERM CURRENT      PERSISTENT   DEFAULT      POSSIBLE
tcp   send_maxbuf           rw   4096         --           49152        4096-1073741824
While it certainly possible to set the value of property back to the same one that is the default, I like the option to set it to its default. This is done with a reset. The PERSISTENT column has reverted back to its original --.
root@Solaris11Express# ipadm reset-prop -p send_maxbuf tcp

root@Solaris11Express# ipadm show-prop -p send_maxbuf tcp
PROTO PROPERTY              PERM CURRENT      PERSISTENT   DEFAULT      POSSIBLE
tcp   send_maxbuf           rw   49152        --           49152        4096-1073741824

What About All Those Other ndd Configuration Parameters?

The output of the show-prop operation above is very small compared to what those who use ndd are used to for even just one of the protocols. So what about all the other ndd parameters?

There are two options:

  • Continue to use ndd
  • Use a special parameter conversion of the ndd parameter with ipadm
  • The first is business as usual. The second involves converting the protocol's ndd parameter into one that works with ipadm. The steps that have worked for me are as follows.

    • For any parameter, replace the /dev/protocol and use the protocol as the protocol argument to ipadm. So /dev/tcp becomes tcp.
    • Drop the leading protocol name from the beginning of the parameter, if there is one. So tcp_local_dack_interval becomes _local_dack_interval.
    • If there is no leading procotol name, prepend the property with an underscore (_). For example, tcp_local_dack_interval becomes _tcp_local_dack_interval.
    • For the IP protocol, if there are IPv4 and IPv6 ndd values, indicate the ipadm protocol as ipv4 and ipv6, respectively. With ndd, the lack of a 6 means IPv4.
    Examples of each are as follows.

    Dropping the leading protocol name and specifying it for the protocol argument.

    root@Solaris11Express# ndd -get /dev/tcp tcp_local_dack_interval
    50
    
    root@Solaris11Express# ipadm show-prop -p _local_dack_interval tcp
    PROTO PROPERTY              PERM CURRENT      PERSISTENT   DEFAULT      POSSIBLE
    tcp   _local_dack_interval  rw   50           --           50           10-500
    
    Getting a parameter that does not start with the protocol.
    root@Solaris11Express# ndd -get /dev/ip arp_probe_interval
    1500
    
    root@Solaris11Express# ipadm show-prop -p _arp_probe_interval ip
    PROTO PROPERTY              PERM CURRENT      PERSISTENT   DEFAULT      POSSIBLE
    ip    _arp_probe_interval   rw   1500         --           1500         10-20000
    
    Distinguishing between IPv4 and IPv6 parameters.
    root@Solaris11Express# ndd -get /dev/ip ip_strict_dst_multihoming
    0
    root@Solaris11Express# ndd -get /dev/ip ip6_strict_dst_multihoming
    0
    
    root@Solaris11Express# ipadm show-prop -p _strict_dst_multihoming ipv4
    PROTO PROPERTY              PERM CURRENT      PERSISTENT   DEFAULT      POSSIBLE
    ipv4  _strict_dst_multihoming rw 0            --           0            0-1
    root@Solaris11Express# ipadm show-prop -p _strict_dst_multihoming ipv6
    PROTO PROPERTY              PERM CURRENT      PERSISTENT   DEFAULT      POSSIBLE
    ipv6  _strict_dst_multihoming rw 0            --           0            0-1
    
    And when there is an error. All the fields have ? in them.
    root@Solaris11Express# ipadm show-prop -p _strict_dst_multihoming ip
    PROTO PROPERTY              PERM CURRENT      PERSISTENT   DEFAULT      POSSIBLE
    ipadm: warning: cannot get property '_strict_dst_multihoming' for 'ip'Unknown property
    ip    _strict_dst_multihoming ?  ?            ?            ?            ?
    
    As more properties are added to ipdam to manage there directly, it will become less necessary to do the ndd work-around.

    Wednesday Jun 08, 2011

    ZFS zpool and file system version numbers and features

    Often enough I have had to check the version of a ZFS pool or file system version. Sometimes, I am curious where a specific feature was delivered. So I imagine this could be useful for others. (Updated 21 Feb 2012 for Solaris 10 8/11 and Solaris 11.)

    One note is that ZFS versions are backward compatible, which means that a kernel with a newer version can import an older version. The reverse is not true. So it is important to know what the oldest kernel version you might want to attach a pool to is, and make sure you don't upgrade your pool or file system to something newer. This table may help with that as well.

    Note: This table is sorted by pool version, then file system version. The availability dates of the releases are not chronological, as a feature delivered in a version of Solaris 11 may be delivered in later Solaris 10 update.

    delivered in zpool version zfs version features comments
    Solaris 11 11/11 33 5
    • Encryption
    • Label support for Trusted Extensions
    Solaris 11 Express 2010.11 31 5
    • deduplication
    • diff for snapshots
    • read-only pool import
    • pool import with missing log device
    Solaris 10 8/11 29 5
    • ZFS installation with Flash Archives (not really a ZFS feature)
    • ZFS send will include file system properties
    • ZFS diff
    • Pool import with missing log device
    • Pool import as read-only
    • Synchronous writes
    • ACL improvements
    • Improvements in pool messages
    Solaris 10 9/10 22 4
    • triple parity RAID-Z (raidz3)
    • logbias property
    • pool recovery
    • mirror splitting
    • device replacement enhancements
    • ZFS system process
    Solaris 10 10/09 10 3
    • ZFS with flash installation
    • user and group quotas
    • ZFS cache devices (L2ARC)
    • set ZFS properties at file system creation
    • primarycache and secondarycache properties
    • log device recovery
    Solaris 10 5/09 10 3
    • zone clone creates ZFS clone
    Solaris 10 10/08 10 3
    • separate ZIL log devices
    • ZFS boot/root file system
    • zone on ZFS
    • recursive snapshot renaming
    • snapshot rollback improvements
    • snapshot send improvements
    • gzip compression
    • multiple user data copies

    • quotas and reservations can exclude snapshots/clones
    • failure mode options
    • ZFS upgrade option
    • delegated administration
    In Solaris 10 10/08 and later, zpool and zfs have the version option. It shows the version of the pool or file system, even if it is an older ZFS pool.
    Solaris 10 5/08 4 1 Pool version determined using zdb(1M) on Solaris 10 5/08
    Solaris 10 8/07 4 1
    • iSCSI support
    • zpool history
    • ability to set properties when creating file system
    Pool version determined using zdb(1M) on Solaris 10 8/07
    Solaris 10 11/06 3 1
    • recursive snapshots
    • double parity RAID-Z (raidz2)
    • clone promotion
    Pool version determined using zdb(1M) on Solaris 10 11/06
    Solaris 10 6/06 2 1
    • pool upgrade
    • restore of destroyed pool
    • integration into Solaris FMA
    • file system monitoring (fsstat)
    Initial release of ZFS in Solaris 10

    Pool version determined using zdb(1M) on Solaris 10 6/06

    The details of all the ZFS features introduced in the Solaris 10 updates are listed in Chapter 1 of the ZFS Administration Guide and for Solaris 11 Express in its ZFS Administration Guide.

    Hope this helps!

    Steffen

    Monday Apr 04, 2011

    Why Are Packets Going Out The Wrong Interface--Preserving For Historical Reasons

    I had previously referenced James Carlson's blog. Because the information is useful, and James is no longer with the company and able to update or preserve it, I am copying his posting here. Thanks again, James, for all the information regarding networking, and specifically Solaris networking, over the years!!

    Steffen

    Dated: Thursday Apr 30, 2009

    The Problem

    A common complaint for Solaris users runs something like this:

    • I have a Solaris system with two Ethernet interfaces connected to different subnets. Sometimes, I see an IP packet come in on one interface, but the packet goes back out a different one.
    • This behavior is bad for my network, because I have firewalls that check the packet sources, and they drop these misdirected packets. Why does Solaris do this? And how can I fix it? I've tried disabling routing, but that doesn't seem to help.
    Problems like this when reported are usually closed out as "will not fix," as for example CR 4085133.

    The Why

    The underlying problem here is at least partly a misunderstanding of how TCP/IP works. When a system transmits a packet, it must locate the "best" interface over which to send it. By default, the algorithm for doing that is as described in RFC 1122 section 3.3.1. Note in particular section 3.3.1.1. This requires the system to look at local interfaces first -- all of them -- to try to match the destination address. And once we find the interface by the destination address, we're done.

    That alone is enough to make things not work as expected. If you send a packet to the local address on ce0 from some other system, but that other system is best reachable through bge0, then we'll send the reply via bge0. It doesn't go back out through ce0, even if the original request came in that way.

    When considering a non-interface route (whether only the "default routes" of RFC 1122 or the more flexible CIDR routes of RFC 1812), the system will look up the route by destination IP address alone, and then use the route to obtain the output interface. This often causes the same sort of confusion when a "default route" ends up causing packets to go to the default router that the administrator thinks don't belong there.

    I actually consider this a design feature of TCP/IP, and not a flaw. It's part of the robustness that IP's datagram routing system offers: every node in the network -- hosts and routers alike -- independently determines the best way to send each distinct datagram based solely on the destination IP address. This allows for "healing" of broken networks, as the failure of one interface or router means that you can potentially still use a different (perhaps less preferred) one to send your message.

    There are some related bits of confusion in this area. For example, some programmers think that binding to a particular IP address means that the interface with that address is "bound" and all packets will go out that way. That's not correct. The system still uses the destination address to pick the output path for each individual IP packet, even if your socket is bound to an address on some particular interface. And, as long as you don't set the ip_strict_dst_multihoming ndd flag (it's not set by default), binding to an address doesn't mean that packets will only arrive on that corresponding interface. They can arrive on any interface in the system, as long as the IP address matches the one bound.

    The Solutions

    There are many ways to fix this issue, and the right answer for a given situation likely depends on the details of that situation.

    • The main issue here is the kernel's forwarding table, so putting the right things into the forwarding table is one of the first tasks.

      A common problem is that the administrator has set up a "default router," but that specified router cannot correctly forward to all possible IP destinations. Some packets the system sends end up getting misdirected or lost as a result. The solution is not having that router as a "default router," and instead using more specific routes (perhaps running a listen-only routing protocol to simplify the administrative burden).

    • Some systems have a "route by source address" feature. Solaris isn't one of those, though there is an RFE open on it (see CR 4777670). A better answer, in my opinion, would be to do something similar to what's suggested in CR 4173841. That would be, when we have multiple matching routes, to prefer a route that gives us an output interface in the same subnet as the source address.

      It's a simple tweak, and would at least fix the folks who have problems default route selection. It would not fix the problems people with interfaces on separate subnets have, though.

    • Applications that care about interface selection can use IP_BOUND_IF or IP_PKTINFO to select the specific interface desired.

      See the ip(7P) man page on your system for details.

    • If all else fails, you can use IP Filter's fastroute/to keyword on an output interface to put packets right where you want them. You should be aware that when you do this, you're circumventing IP's routing features, which means that if there's an interface or path failure, you may cause connections to fail that didn't need to fail.

    Tuesday Nov 23, 2010

    Getting GDM to work on text Solaris 11 Express 2010.11 installs

    One of the features of Solaris 11 Express is to install into a ZFS pool, which allows updates to be easily managed using ZFS snapshots and clones. The LiveCD install, however, does not offer the option to save space for another ZFS pool. I prefer to have a separate pool for data, even on my single-disk laptop. The only way to do that as I can tell is to install using the text installer. One side effect of the test installer is that it does not install everything necessary to run a GUI desktop, which is very handy on a laptop.

    Thanks to some replies to an internal question I posted, there is a relatively easy way to add the necessary packages to allow GDM and related tools to work. I have used them several times, and this writeup describes them.

    The initial text based install put 494 packages on the system.

    Solaris 11 Express 2010.11# pkg list | wc -l
    495
    Solaris 11 Express 2010.11# pkg list | head
    NAME (PUBLISHER)                              VERSION         STATE      UFOXI
    SUNWcs                                        0.5.11-0.151.0.1 installed  -----
    SUNWcsd                                       0.5.11-0.151.0.1 installed  -----
    archiver/gnu-tar                              1.23-0.151.0.1  installed  -----
    compress/bzip2                                1.0.6-0.151.0.1 installed  -----
    compress/gzip                                 1.3.5-0.151.0.1 installed  -----
    compress/p7zip                                4.55-0.151.0.1  installed  -----
    compress/unzip                                5.53.7-0.151.0.1 installed  -----
    compress/zip                                  2.32-0.151.0.1  installed  -----
    consolidation/SunVTS/SunVTS-incorporation     0.5.11-0.151.0.1 installed  -----
    
    To add the required packages to the system, the slim_install package has to be added. This adds an additional 390 packages to the system.
    Solaris 11 Express 2010.11# pkg install slim_install
                   Packages to install:   390
               Create boot environment:    No
                   Services to restart:    10
    DOWNLOAD                                  PKGS       FILES    XFER (MB)
    Completed                              390/390 42204/42204  410.5/410.5
    
    PHASE                                        ACTIONS
    Install Phase                            67952/67952
    
    PHASE                                          ITEMS
    Package State Update Phase                   390/390
    Image State Update Phase                         2/2
    
    After this, I did a reboot, just to make sure. Then I uninstalled the slim_install package, which removed only that one. The other 389 packages must have been dependencies of slim_install.
    Solaris 11 Express 2010.11# pkg uninstall slim_install
                    Packages to remove:     1
               Create boot environment:    No
    PHASE                                        ACTIONS
    Removal Phase                                828/828
    
    PHASE                                          ITEMS
    Package State Update Phase                       1/1
    Package Cache Update Phase                       1/1
    Image State Update Phase                         2/2
    
    Once I enable GDM, the screen show action and shortly I have the familiar GUI login prompt.
    Solaris 11 Express 2010.11# svcs gdm
    STATE          STIME    FMRI
    disabled       12:26:40 svc:/application/graphical-login/gdm:default
    
    Solaris 11 Express 2010.11# svcadm enable gdm
    
    Solaris 11 Express 2010.11# svcs gdm
    STATE          STIME    FMRI
    online         12:38:11 svc:/application/graphical-login/gdm:default
    
    I hope this helps others. I certainly know where to look when I have to do this again!

    Steffen

    [Updated 2010.11.23]

    First, I'd like to acknowledge Keith Mitchell who provided me with the suggestion to do the install and uninstall of the slim_install package.

    Second, in the process of checking in with Keith, he suggested taking care when doing the above operations while logged in on the console. If you leave yourself logged in at the console when GDM starts, there are small possibilities of certain devices not being configured properly when logging into gnome, due to how logindevperm works. Suggestions include:

    svcadm enable gdm && exit
    
    or
    svcadm enable gdm; exit
    
    I did this remotely, at least the most recent time, to capture the output for this blog. I did not notice any effects when I had done this the first time on a different system, however, I might have reboot at that point anyway.

    Thanks again to Keith for his tips!

    Monday Nov 15, 2010

    Solaris 11 Express 2010.11 is available!!

    Congratulations to every for getting Solaris 11 Express out! With this release come a lot of networking improvements and features, including the following:
    • Network Virtualization and Resource Management (Crossbow) with VNICs and flows
    • New IP administrative interface (ipadm)
    • IPMP rearchitecture
    • IP observability and data link statistics (dlstat)
    • Link protection when a data link is given to a guest VM
    • Layer 2 bridging
    • More device types supported by dladm, including WiFi
    • Network Automagic for automatic network selection on desktops
    • Improved socket interface for better performance
    Get more information here.

    More to follow in the future!

    Steffen

    Friday Oct 15, 2010

    New privilege added to the 'basic' Least Privilege set

    Oracle Solaris 10 9/10 (update 9) has added another privilege to the basic set of privileges, the set that all unprivileged (non-root) users have by default.

    With Least Privileges, a non-root process by default has the ability to get process information, create and delete files, fork and exec, and now separately open TCP or UDP end points. The ppriv(1) command prints the list of privileges.

    Solaris 10 9/10# ppriv -l basic
    file_link_any
    proc_exec
    proc_fork
    proc_info
    proc_session
    net_access
    
    A verbose listing includes basic descriptions, which are also described in privileges(5).

    Solaris 10 9/10# ppriv -lv basic
    file_link_any
           Allows a process to create hardlinks to files owned by a uid
           different from the process' effective uid.
    proc_exec
           Allows a process to call execve().
    proc_fork
           Allows a process to call fork1()/forkall()/vfork()
    proc_info
           Allows a process to examine the status of processes other
           than those it can send signals to.  Processes which cannot
           be examined cannot be seen in /proc and appear not to exist.
    proc_session
           Allows a process to send signals or trace processes outside its
           session.
    net_access
           Allows a process to open a TCP or UDP network endpoint.
    
    With the addition of the net_access privilege, it is now possible to prevent a process from creating sockets and network end points, isolating the process from the network. By default, processes have this privilege, so any action would be to remove it.

    To demonstrate this I am using the ppriv command to limit the privilege of a command and see with the debug flag what is happening.

    Even as an unprivileged user I can see if a specific IP address is in use with the ping command. So lets see what happens when I don't have the net_access privilege. I am doing this as a basic user.

    Solaris 10 9/10$ ppriv -D -s I-net_access -e /usr/sbin/ping 172.16.1.1
    ping[14942]: missing privilege "net_access" (euid = 1001, syscall = 5) 
       for "devpolicy" needed at spec_open+0xd0
    ping[14942]: missing privilege "net_access" (euid = 1001, syscall = 5) 
       for "devpolicy" needed at spec_open+0xd0
    ping[14942]: missing privilege "net_access" (euid = 1001, syscall = 5) 
       for "devpolicy" needed at spec_open+0xd0
    /usr/sbin/ping: unknown host 172.16.1.1
    
    Since I am forking a process with the -e option, I limit the I (inherited) privilege set with the net_access removed. The debug output shows that its net_access that is missing, and it happens three time.

    To see how it would look with the privilege, I run the same command with the basic set inherited.

    Solaris 10 9/10$ ppriv -D -s I=basic -e /usr/sbin/ping 172.16.1.1
    172.16.1.1 is alive 
    
    Everything worked, and no debug output.

    Its a good idea to use predefined sets such as basic, so that changes in the set don't affects script in the future.

    Steffen

    Thursday Jun 17, 2010

    TCP Fusion and improved loopback traffic

    In the past, when two processes were communicating using TCP on the same system, a lot of the TCP and IP protocol processing was performed just as it was for traffic to and from another system. Since a significant amount of CPU is spent on the protocol layers both sending and receiving to insure successful, complete, in order, non-duplicated, re-routed around network failures for data that never left the system, there is considerable performance benefit in providing a short circuit for the data.

    In Solaris 10 6/06 a feature called TCP Fusion was delivered, which removed all the stack processing when both ends of the TCP connection are in the same system, and now with IP Instances, in the same IP Instance (between the global zone and all shared IP zones, or within an exclusive zone). There are some exceptions to this, including when using IPsec, IPQoS, raw-socket, kernel SSL, non-simple TCP/IP conditions. or the two end points are on different squeues. A fused connect will revert to unfused if an IP Filter rule will drop a packet. However TCP fusion is done in the general case.

    So why do I bring this up? With TCP fusion enabled (which it is by default in Solaris 10 6/06 and later, and in OpenSolaris), when a TCP connection is created between processes on a system, the necessary things are set up to transfer data from the sender to the receiver without sending it down and back up the stack. The typical flow control of filling a send buffer (defaults to 48K or the value of tcp_xmit_hiwat, unless changed via a socket operation) still applies. With TCP Fusion on, there is a second check, which is the number of writes to the socket without a read. The reason for the counter is to allow the receiver to get CPU cycles, since the sender and receiver are on the same system and may be sharing one or more CPUs. The default value of this counter is eight (8), as determined by tcp_fusion_rcv_unread_min. The value per TCP connection is calculated as

    MAX(sndbuf >> 14, tcp_fusion_rcv_unread_min);
    
    Some details of the reasoning and implementation are in Change Request 4821256.

    When doing large writes, or when the receiver is actively reading, the buffer flow control dominates. However, when doing smaller writes, it is easy for the sender to end up with a condition where the number of consecutive writes without a read is exceeded, and the writer blocks, or if using non-blocking I/O, will get an EAGAIN error.

    The latter was a case at a customer of mine. An ISV application was reporting EAGAIN errors on a new installation, something that hadn't been seen before. More importantly, the ISV was also not seeing it elsewhere or in their test environment.

    After some investigation using DTrace, including reproduction on slightly different system configuration, it became clear that the sending application was getting the error after a burst of writes. The application has both local and remote (on other systems) receivers, and the EAGAIN errors were only happening on the local connection.

    I also saw that the application was repeatedly doing a pair of writes, one of 12 bytes and the second of 696 bytes. Thus it would be easy to hit the consecutive write counter before the write buffer is ever filled.

    To test this I suggested the customer change the tcp_fusion_rcv_unread_min on their running system using mdb(1). I suggested they increase the counter by a factor of four (4), just to be safe.

    # echo "tcp_fusion_rcv_unread_min/W 32" | mdb -kw
    tcp_fusion_rcv_unread_min:      0x8            =       0x20
    
    Here is how you check what the current value is.
    # echo "tcp_fusion_rcv_unread_min/D" | mdb -k
    tcp_fusion_rcv_unread_min:
    tcp_fusion_rcv_unread_min:      32
    
    After running several hours of tests, the EAGAIN error did not return.

    Since then I have suggested they set tcp_fusion_rcv_unread_min to 0, to turn the check off completely. This will allow the buffer size and total outstanding write data volume to determine whether the sender is blocked, as it is for remote connections. Since the mdb is only good until the next reboot, I suggested the customer change the setting in /etc/system.

    \* Set TCP fusion to allow unlimited outstanding writes up to the TCP send buffer set by default or the application.
    \* The default value is 8.
    set ip:tcp_fusion_rcv_unread_min=0
    
    To turn TCP Fusion off all together, something I have not tested with, the variable do_tcp_fusion can be set from its default 1 to 0.

    I hope this helps someone who might be trying to understand why errors, or maybe less than expected throughput, is being seen on local connections.

    And I would like to note that in OpenSolaris only the do_tcp_fusion setting is available. With the delivery of CR 6826274, the consecutive write counting has been removed. The TCP Fusion code has also been moved into its own file

    Thanks to Jim Eggers, Jim Fiori, Jim Mauro, Anders Parsson, and Neil Putnam for their help as I was tracking all this stuff down!

    Steffen

    PS. After publishing, I wrote this DTrace script to show what the per connection outstanding write counter tcp_fuse_rcv_unread_hiwater is set to.

    # more tcp-fuse.d
    #!/usr/sbin/dtrace -qs
    
    fbt:ip:tcp_fuse_maxpsz_set:entry
    {
            self->tcp = (tcp_t \*) arg0;
    }
    
    fbt:ip:tcp_fuse_maxpsz_set:return
    /self->tcp > 0/
    {
            this->peer = (tcp_t \*) self->tcp->tcp_loopback_peer;
            this->hiwat = this->peer->tcp_fuse_rcv_unread_hiwater;
    
            printf("pid: %d tcp_fuse_rcv_unread_hiwater: %d \\n", pid, this->hiwat);
    
            self->tcp = 0;
            this->peer = 0;
            this->hiwat = 0;
    }
    

    Wednesday Feb 24, 2010

    Solaris 10 Zones and Networking -- Common Considerations

    As often happens, a customer question resulted in this write-up. The customer had to quickly consider how they deploy a large number of zones on an M8000. They would be configuring up to twelve separate links for the different networks, and double that for IPMP. I wrote up the following. Thanks to Penny Cotten, Jim Eggers, Gordon Lythgoe, Peter Memishian, and Erik Nordmark for the feedback as I was preparing this. Also, you may see some of this in future documentation.

    Definitions

    • Datalink: An interface at Layer 2 of the OSI protocol stack, which is represented in a system as a STREAMS DLPI (v2) interface. Such an interface can be plumbed under protocol stacks such as TCP/IP. In the context of Solaris 10 Zones, datalinks are physical interfaces (e.g. e1000g0, bge1), aggregations (aggr3), or VLAN-tagged interfaces (e1000g111000 (VLAN tag 111 on e1000g0), bge111001, aggr111003). A datalink may also be referred to as a physical interface, such as when referring to a Network Interface Card (NIC). The datalink is the 'physical' property configured with the zone configuration tool zonecfg(1M).
    • Non-global Zone: A non-global zone is any zone, whether native or branded, that is configured, installed, and managed using the zonecfg(1M) and zoneadm(1M) commands in Solaris 10. A branded zone may be either Solaris 8 or Solaris 9.

    Zone network configuration: shared versus exclusive IP Instances

    Since Solaris 10 8/07, zone configurations can be either in the default shared IP Instance or exclusive IP Instance configuration.

    When configured as shared, zone networking includes the following characteristics.

    • All datalink and IP, TCP, UDP, SCTP, IPsec, etc. configuration is done in the global zone.
    • All zones share the network configuration settings, including datalink, IP, TCP, UDP, etc. This includes ndd(1M) settings.
    • All IP addresses, netmasks, and routes are set by the global zone and can not be altered in a non-global zone.
    • Non-global zones can not utilize DHCP (neither client nor server). There is a work-around that may allow a zone to be a DHCP server.
    • By default a privileged user in a non-global zone can not put a datalink into promiscuous mode, and thus can not run things like snoop(1M). Changing this requires adding the priv_net_raw privilege to the zone from the global zone, and also requires identifying which interface(s) to allow promiscuous mode on via the 'match' zonecfg parameter. Warning: This allows the non-global zone to send arbitraty packets on those interfaces.
    • IPMP configuration is managed in the global zone and applies to all zones using the datalinks in the IPMP group. All non-global zones configured with one datalink can or must use all datalinks in the IPMP group. Non-global zones can use multiple IPMP groups. The zone must be configured with only one datalink from each IPMP group.
    • Only default routes apply to the non-global zones, as determined by the IP address(es) assigned to the zone. Non-default static routes are not supported to direct traffic leaving a non-global zone.
    • Multiple zones can share a datalink.
    When configured as exclusive, zone networking includes the following characteristics.
    • All network configuration can be done within the non-global zone (and can also be done indirectly from the global zone (via zlogin(1) or editing the files in the non-global zone's root file system).
    • IP and above configurations can not be seen directly within the global zone (e.g. running ifconfig(1M) in the global zone will not show the details of a non-global zone).
    • The non-global zone's interface(s) can be configured via DHCP, and the zone can be a DHCP server.
    • A privileged user in the non-global zone can fully manipulate IP address, netmask, routes, ndd variables, logical interfaces, ARP cache, IPsec policy and keys, IP Filter, etc.
    • A privileged user in the non-global zone can put the assigned interface(s) into promiscuous mode (e.g. can run snoop).
    • The non-global zone can have unique IPsec properties.
    • IPMP must be managed within the non-global zone.
    • A datalink can only be used by a single running zone at any one time.
    • Commands such as snoop(1M) and dladm(1M) can be used on datalinks in use by running zones.
    It is possible to mix shared and exclusive IP zones on a system. All shared zones will be sharing the configuration and run time data (routes, ARP, IPsec) of the global zone. Each exclusive zone will have its own configuration and run time data, which can not be shared with the global zone or any other exclusive zones.

    IP Multipathing (IPMP)

    By default, all IPMP configurations are managed in the global zone and affects all non-global zones whose network configuration includes even one datalink (the net->physical property in zonecfg(1M)) in the IPMP group. A zone configured with a datalinks that are part of IPMP groups must only configure each IP address on only one of the datalinks in the IPMP group. It is not necessary to configure an IP address on each datalink in the group. The global zone's IPMP infrastructure will manage the fail-over and fail-back of datalinks on behalf of all the shared IP non-global zones.

    For exclusive IP zones, the IPMP configuration for a zone must be managed from within the non-global zone, either via the configuration files or zlogin(1).

    The choice to use probe-based failure detection or link-based failure detection can be done on a per-IPMP group basis, and does not affect whether the zone can be configured as shared or exclusive IP Instance. Care must be taken when selecting test IP addresses, since they will be configured in the global zone and thus may affect routing for either the global or for the non-global zones.

    Routing and Zones

    The normal case for shared-IP zones is that they use the same datalinks and the same IP subnet prefixes as the global zone. In that case the routing in the shared-IP zones are the same as in the global zone. The global zone can use static or dynamic routing to populate its routing table, that will be used by all the shared-IP zones.

    In some cases different zones need different IP routing. The best approach to accomplish this is to make those zones be exclusive-IP zones. If this is not possible, then one can use some limited support for routing differentiation across shared-IP zones. This limited support only handles static default routes, and only works reliably when the shared-IP zones use disjoint IP subnets.

    All routing is managed by zone that owns the IP Instance. The global zones owns the 'default' IP Instance that all shared IP zones use. Any exclusive IP zone manages the routes for just that zone. Different routing policies, routing daemons, and configurations can be used in each IP Instance.

    For shared IP zones, only default static routes are supported with those zones. If multiple default routes apply to a non-global zone, care must be taken that all the default routes are able to reach all the destinations that the zone need to reach. A round robin policy is used when multiple default routes are available and a new route needs to be determined.

    The zonecfg(1M) 'defrouter' property can be used to define a default router for a specific shared IP zone. When a zone is started and the parameter is set, a default route on the interface configured for that zone will be created if it does not already exist. As of Solaris 10 10/09, when a zone stops, the default route is not deleted.

    Default routes on the same datalink and IP subnet are shared across non-global zones. If a non-global zone is on the same datalink and subnet as the global zone, default route(s) configured for one zone will apply for all other zones on that datalink and IP subnet.

    Inter-zone network traffic isolation

    There are several ways to restrict network traffic between non-global shared IP zones.
    • The /dev/ip ndd(1M) paramter 'ip_restrict_interzone_loopback', managed from the global zone, will force traffic out of the system on a datalink if the source and destination zones do not share a datalink. The default configuration for this is to allow inter-zone networking using internal loopback of IP datagrams, with the value of this parameter set to '0'. When the value is set to '1', traffic to an IP address in another zone in the shared IP Instance that is not on the same datalink will be put onto the external network. Whether the destination is reached will depend on the full network configuration of the system and the external network. This applies whether the source and destination IP address are on the same or different IP subnets. This parameter applies to all IP Instances active on the system, including exclusive IP Instance zones. In the case of exclusive IP zones, this will apply only if the zone has more than one datalink configured with IP addresses. The for two zones on the same system to communicate with the 'ip_restrict_interzone_loopback' set to '1' requires the following conditions.
      • There is a network path to the destination. If on the same subnet, the switch(es) must allow the connection. If on different subnets, routes must be in place for packets to pass reliably between the two zones.
      • The destination address is not on the same datalink (as this would break the datalink rules).
      • The destination is not on datalink in an IPMP group that the sending datalink is also in.
      The 'ip_restrict_interzone_loopback' parameter is available in Solaris 10 8/07 and later.
    • A route(1M) action to prevent traffic between two IP addresses is available. Using the '-reject' flag will generate an ICMP unreachable when this route is attempted. The '-blackhole' flag will silently discard datagrams.
    • The IP Filter action 'intercept_loopback' will filter traffic between sockets on a system, including traffic between zones and loopback traffic within a zone. Using this action prevents traffic between shared IP zones. It does not force traffic out of the system using a datalink. More information is in the ipf.conf(4) or ipf(4) manual page.

    Aggregations

    Solaris 10 1/06 and later support IEEE 802.3ad link aggregations using the dladm(1M) datalink administration command. Combining two or more datalinks into an aggregation effectively reduces the number of datalinks available. Thus it is important to consider the trade-offs between aggregations and IPMP when requiring either network availability or increased network bandwidth. Full traffic patterns must be understood as part of the decision making process.

    For the 'ce' NIC, Sun Trunking 1.3.1 is available for Solaris 10.

    Some considerations when making a decision between link aggregation and IPMP are the following.

    • Link aggregation requires support and configuration of aggregations on both ends of the link, i.e. both the system and the switch.
    • Most switches only support link aggregation within a switch, not spanning two or more switches.
    • Traffic between a single pair of IP addresses will typically only utilize one link in either an aggregation or IPMP group.
    • Link aggregation only provides availability between the switch ports and the system. IPMP using probe-based failure detection can redirect traffic around internal switch problems or network issues behind the switches.
    • Multiple hashing policies are available, and they can be set differently for inbound and outbound traffic.
    • IPMP probe-based failure detection required test addresses for each datalink in the IPMP group, which are in addition to the application or data address(es).
    • IPMP link-based failure detection will cause a fail-over or fail-back based on link state only. Solaris 10 supports IPMP configured in only link-based mode. If IPMP is configured in probe-based failure detection, link failure will also cause fail-over, and a link restore will cause a fail-back.
    • A physical interface can be in only one aggregation. VLANs can be configured over an aggregation.
    • A datalink can be in only one IPMP group.
    • An IPMP group can use aggregations as the underlying datalinks.

    Note, this is for Solaris 10. OpenSolaris has differences. Maybe something for another day.

    I hope this is helpful! Steffen

    My thoughts on configuring zones with shared IP instances and the 'defrouter' parameter

    An occasional call or email I receive has questions about routing issues when using Solaris Zones in the (default) shared IP Instance configuration. Everything works well when the non-global zones are on the same IP subnet (lets say 172.16.1.0/24) as the global zone. Routing gets a little tricky when the non-global zones are on a different subnet.

    My general recommendation is to isolate. This means:

    • Separate subnets for the global zone (administration, backup) and the non-global zones (applications, data).
    • Separate data-links for the global and non-global zones.
      • The non-global zones can share a data-link
      • Non-global zones on different IP subnets use different data-links
    Using separate data-links is not always possible. I was concerned whether this would actually work.

    So I did some testing, and exchanged some emails because of a comment I made regarding PSARC/2008/057 and the automatic removal of a default route when the zone is halted.

    Turns out I have been very restrictive in suggesting that the global and non-global zones not share a data-link. While I think that is a good administrative policy, to separate administrative and application traffic, it is not a requirement. It is OK to have the global zone and one or more non-global zones share the same data-link. However, if the non-global zones are to have different default routes, they must be on subnets that the global zone is not on.

    My test case running Solaris 10 10/09 has the global zone on the 129.154.53.0/24 network and the non-global zone on the 172.16.27.0/24 network.

    global# ifconfig -a
    ...
    e1000g0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
            inet 129.154.53.132 netmask ffffff00 broadcast 129.154.53.255
            ether 0:14:4f:ac:57:c4
    e1000g0:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
            zone shared1
            inet 172.16.27.27 netmask ffffff00 broadcast 172.16.27.255
    
    global# zonecfg -z shared1 info net
    net:
            address: 172.16.27.27/24
            physical: e1000g0
            defrouter: 172.16.27.16
    
    
    The routing table as seen from both are:
    global# netstat -rn
    
    Routing Table: IPv4
      Destination           Gateway           Flags  Ref     Use     Interface
    -------------------- -------------------- ----- ----- ---------- ---------
    default              129.154.53.215       UG        1        123
    default              172.16.27.16         UG        1          7 e1000g0
    129.154.53.0         129.154.53.132       U         1         50 e1000g0
    224.0.0.0            129.154.53.132       U         1          0 e1000g0
    127.0.0.1            127.0.0.1            UH        3         80 lo0
    
    shared1# netstat -rn
    
    Routing Table: IPv4
      Destination           Gateway           Flags  Ref     Use     Interface
    -------------------- -------------------- ----- ----- ---------- ---------
    default              172.16.27.16         UG        1          7 e1000g0
    172.16.27.0          172.16.27.27         U         1          3 e1000g0:1
    224.0.0.0            172.16.27.27         U         1          0 e1000g0:1
    127.0.0.1            127.0.0.1            UH        4         78 lo0:1
    
    While the global zone shows both routes, only the default applying to its subnet will be used. And for traffic leaving the non-global zone, only its default will be used.

    You may notice that the Interface for the global zone's default router is blank. That is because I have set the default route via /etc/defaultrouter. I noticed that if it is determined via the route discovery daemon, it will be listed as being on e1000g0! This does not affect the behavior, however it may be visually confusing, which is probably why I initially leaned towards saying to not share the data-link.

    There are multiple ways to determining which route might be used, including ping(1M) and traceroute(1M). I like the output of the route get command.

    global# route get 172.16.29.1
       route to: 172.16.29.1
    destination: default
           mask: default
        gateway: 129.154.53.1
      interface: e1000g0
          flags: <UP,GATEWAY,DONE,STATIC>
     recvpipe  sendpipe  ssthresh    rtt,ms rttvar,ms  hopcount      mtu     expire
           0         0         0         0         0         0      1500         0
    
    shared1# route get 172.16.28.1
       route to: 172.16.28.1
    destination: default
           mask: default
        gateway: 172.16.27.16
      interface: e1000g0:1
          flags: <UP,GATEWAY,DONE,STATIC>
     recvpipe  sendpipe  ssthresh    rtt,ms rttvar,ms  hopcount      mtu     expire
           0         0         0         0         0         0      1500         0
    
    This quickly shows which interfaces and IP addresses are being used. If there are multiple default routes, repeated invocations of this will show a rotation in the selection of the default routes.

    Thanks to Erik Nordmark and Penny Cotten for their insights on this topic!

    Steffen Weiberle

    Thursday Aug 20, 2009

    Why are packets going out of the "wrong" interface?

    I often refer to this blog by James Carlson, so to help others, and me, find it, here is Packets out of the wrong interface. Thanks James for all the help over the years!

    Steffen

    VLANs and Aggregations

    Every once in a while I see the question asking whether it is possible to use IEEE 802.1q VLANs together with IEEE 802.3ad Link Aggregation. I frequently have to check myself. So in order to better remind me, and share with others, here is a quick demonstration of how to get the two working together.

    My test system is running build 05 of the upcoming Solaris 10 10/09 (update 8). The system has four bge interfaces, and I will use numbers 1 and 2. (This should work just as well with previous updates of Solaris 10, and with Sun Trunking in Solaris 9, except for the zones parts. I am using zones just to isolate my traffic generation and easily get it to use a specific data link.)

    Starting out things like like this.

    global# dladm show-dev
    bge0            link: up        speed: 1000  Mbps       duplex: full
    bge1            link: unknown   speed: 0     Mbps       duplex: unknown
    bge2            link: unknown   speed: 0     Mbps       duplex: unknown
    bge3            link: unknown   speed: 0     Mbps       duplex: unknown
    global# dladm show-link
    bge0            type: non-vlan  mtu: 1500       device: bge0
    bge1            type: non-vlan  mtu: 1500       device: bge1
    bge2            type: non-vlan  mtu: 1500       device: bge2
    bge3            type: non-vlan  mtu: 1500       device: bge3
    global# ifconfig -a4
    lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
            inet 127.0.0.1 netmask ff000000
    bge0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
            inet 129.154.53.125 netmask ffffff00 broadcast 129.154.53.255
            ether 0:3:ba:e3:42:8b
    
    I have my switch set up to aggregate ports 1 and 2, and here is how I do it with Solaris 10.
    global# dladm create-aggr -d bge1 -d bge2 1
    global# dladm show-link
    bge0            type: non-vlan  mtu: 1500       device: bge0
    bge1            type: non-vlan  mtu: 1500       device: bge1
    bge2            type: non-vlan  mtu: 1500       device: bge2
    bge3            type: non-vlan  mtu: 1500       device: bge3
    aggr1           type: non-vlan  mtu: 1500       aggregation: key 1
    
    VLAN tagged interfaces are used by accessing the underlying data link by preceeding the data link ID with the VLAN tag. For bge1 and VLAN 111 that would be bge111001. For for aggr1 it would be aggr111001.

    For this setup I am using zones zone111 and zone112 configured as an exclusive IP Instance. The zone configuration look like this.

    global# zonecfg -z zone111 info
    zonename: zone111
    zonepath: /zones/zone111
    brand: native
    autoboot: false
    bootargs:
    pool:
    limitpriv:
    scheduling-class:
    ip-type: exclusive
    inherit-pkg-dir:
            dir: /lib
    inherit-pkg-dir:
            dir: /platform
    inherit-pkg-dir:
            dir: /sbin
    inherit-pkg-dir:
            dir: /usr
    net:
            address not specified
            physical: aggr111001
            defrouter not specified
    
    Once configured, installed, and booted, the network configuration of zone111 is:
    global# zlogin zone111 ifconfig -a4
    lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
            inet 127.0.0.1 netmask ff000000
    aggr111001: flags=201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS> mtu 1500 index 2
            inet 172.16.111.141 netmask ffffff00 broadcast 172.16.111.255
            ether 0:3:ba:e3:42:8c
    
    Turns out that configuring this was easy compared to showing that the link aggregation was really working. While the full list of links known when the zones are includes the aggregation and the VLANs on the aggregation, tools such a netstat or nicstat would not include them. As it turns out they only report on interfaces that are plumbed up in that IP Instance. It will not be possible to plumb either bge1 or bge2 since they are members of the aggregation.
    global# dladm show-link
    bge0            type: non-vlan  mtu: 1500       device: bge0
    bge1            type: non-vlan  mtu: 1500       device: bge1
    bge2            type: non-vlan  mtu: 1500       device: bge2
    bge3            type: non-vlan  mtu: 1500       device: bge3
    aggr1           type: non-vlan  mtu: 1500       aggregation: key 1
    aggr111001      type: vlan 111  mtu: 1500       aggregation: key 1
    aggr112001      type: vlan 112  mtu: 1500       aggregation: key 1
    global# netstat -i
    Name  Mtu  Net/Dest      Address        Ipkts  Ierrs Opkts  Oerrs Collis Queue
    lo0   8232 loopback      localhost      98     0     98     0     0      0
    bge0  1500 pinebarren    pinebarren     43101  0     7181   0     0      0
    
    So I ended up using kstat(1M) to get the values of the number of outbound packets. I an interested in outbound as that is what Solaris can affect regarding distributing traffic across links in an aggregation--the switch determines that for inbound traffic.

    This example shows data on instance 2 of the bge interface for kstat value opackets.

    global# kstat -m bge -i 2 -s opackets
    module: bge                             instance: 2
    name:   mac                             class:    net
            opackets                        2542
    
    With kstat I can see that for different connections either bge1 or bge2 has packets going out on it. A good test for me was scp to a remote system. Neither ping nor traceroute caused the necessary hashing to use both links in the aggregation.

    Steffen

    Wednesday Jun 17, 2009

    ssh and friends scp, sftp say "hello crypto!"

    Solaris includes the SunSSH toolset (ssh, scp, and sftp) in Solaris 9 and later. Solaris 10 comes with the Solaris Cryptographic Framework that provides an easy mechanism for applications that use PKCS #11, OpenSSL, Java Security Extensions, or the NSS interface to take advantage of cryptographic hardware or software on the system.

    Separately, the UltraSPARC® T2 processor in the T-series (CMT) has built-in cyptographic processors (one per core, or typically eight per socket) that accelerate secure one-way hashes, public key session establishment, and private key bulk data transfers. The latter is useful for long standing connections and for larger data operations, such as a file transfer.

    Prior to Solaris 10 5/09, an scp or sftp file transfer operation had the encryption and decryption done the by the CPU. While usually this is not a big deal, as most CPUs do private key crypto reasonably fast, on the CMT systems these operations are relatively slow. Now with SunSSH With OpenSSL PKCS#11 Engine Support in 5/09, the SunSSH server and client will use the cryptographic framework when an UltraSPARC® T2 process nc2p cryptographic unit is available.

    To demonstrate this, I used a T5120 with Logical Domains (LDoms) 1.1 configured running Solaris 10 5/09. Using LDoms helps, as I can assign or remove crypto units on a per-LDom basis. (Since the crypto units are not supported yet with dynamic reconfiguration, a reboot of the LDom instance is required. However, in general, I don't see making that kind of change very often.)

    I did all the work in the 'primary' control and service LDom, where I have direct access to the network devices, and can see the LDom configuration. I am listing parts of it here, although this is about Solaris, SunSSH, and the crypto hardware.

    medford# ldm list-bindings primary
    NAME             STATE      FLAGS   CONS    VCPU  MEMORY   UTIL  UPTIME
    primary          active     -n-cv-  SP      16    8G       0.1%  22h 16m
    
    MAC
        00:14:4f:ac:57:c4
    
    HOSTID
        0x84ac57c4
    
    VCPU
        VID    PID    UTIL STRAND
        0      0      0.6%   100%
        1      1      1.9%   100%
        2      2      0.0%   100%
        3      3      0.0%   100%
        4      4      0.0%   100%
        5      5      0.1%   100%
        6      6      0.0%   100%
        7      7      0.0%   100%
        8      8      0.7%   100%
        9      9      0.1%   100%
        10     10     0.0%   100%
        11     11     0.0%   100%
        12     12     0.0%   100%
        13     13     0.0%   100%
        14     14     0.0%   100%
        15     15     0.0%   100%
    
    MAU
        ID     CPUSET
        0      (0, 1, 2, 3, 4, 5, 6, 7)
        1      (8, 9, 10, 11, 12, 13, 14, 15)
    
    MEMORY
        RA               PA               SIZE
        0x8000000        0x8000000        8G
    
    The 'system' has 16 CPUs (hardware strands), two MAUs (those are the crypto units), and 8 GB of memory. I am using e1000g0 for the network and the remote system is a V210 running Solaris Express Community Edition snv_113 SPARC (OK, I am a little behind). The network is 1 GbE.

    The command I run is

    source#/usr/bin/time scp -i /.ssh/destination /large-file destination:/tmp
    
    source# du -h /large-file
     1.3G   /large-file
    
    My results with the crypto units were
    real     1:13.6
    user       32.2
    sys        34.5
    
    while without the crypto units
    real     2:28.2
    user     2:10.9
    sys        26.8
    
    The transfer took one half the time and considerably less CPU processing with the crypto units in place (I have two although I think it is using only one since this is a single transfer).

    So, SunSSH benefits from the built-in cryptographic hardware in the UltraSPARC® T2 process!

    Steffen

    Monday Jun 01, 2009

    OpenSolaris 2009.06 Delivers Crossbow (Network Virtualization and Resource Control)

    Today OpenSolaris 2009.06, the third release of OpenSolaris, is announced and available for download. Among the many features in this version is the delivery of Project Crossbow, in a fully supported distribution. This brings network virtualization, including Virtual NICs (VNICs), bandwidth control and management, flow (QoS) creation and management, virtual switches, and other features to OpenSolaris.

    Network virtualization joins a number of other features already in OpenSolaris, such as vanity naming (allowing custom names for data links), snooping on loopback for better observability, a re-architected IPMP with an administrative interface, and Network Automagic (NWAM--automatic configuration of desktop networking based on available wired and wireless network services).

    Congratulations to everyone who made all this possible!

    Steffen PS: Regarding the fully supported, please notice the new support prices and durations!

    Thursday Apr 16, 2009

    Sun Shared Shell - A Cool Diagnostic Tool

    [Updated 2010.10.12 with new URL]

    As part of helping a customer out recently on an escalation, the SSE on the case suggested using Sun Shared Shell, a tool that allows you to see and optionally control a remote system. It supports SSH and Telnet.

    This tool was instrumental in increasing my understanding of what was going on with the customer's system, and removed the need to wait for output via emails or just trying to understand things over the phone. The owner of the session, usually the customer, has the option of allowing you to enter commands (without hitting 'Return'), or even allowing the 'Return' as well. It also has logging and chatting capabilities.

    When first logging in, it allows you to be the owner of the shell and share that with other participants, or to view someone else's shell session.

    Once logged in, you have a terminal window, the people present on the connection, and a chat window. The icon before the name/email address shows whether you have view, type, or full control (the keyboard will also have a down-arrow with it).

    Oh, and I forgot about the feature to scribble on the screen. I used that to diagram out an idea I had to solve a zone networking issue, and it helped the others understand what I was proposing a lot quicker!

    In the spirit of 'asking for what you want instead of complaining about what you don't have', I submitted a few suggestions, and the owner(s) quickly responded with clarifications.

    I see this as a great tool to help future cases where a shared view of operations will improve understanding or service delivery! Thanks to those who created and maintain it! Steffen

    What happened to my packets? -- or -- Dual default routes and shared IP zones

    I recently received a call from someone who has helped me out a lot on some performance issues (thanks, Jim Fiori), and I was glad to be able to return even a small part of those favors!

    He had been contacted to help a customer who was ready to deploy a web application, and they were experiencing intermittent lack of connection to the web site. Interestingly, they were also using zones, a bunch of them (OK, a handful)--and so right up my alley.

    The customer was running a multi-tiered web application on an x4600 (so Solaris on x86 as well!), with the web server, web router, and application tiers in different zones. They were using shared IP Instances, so all the network configuration was being done in the global zone.

    Initially, we had to modify some configuration parameters, especially regarding default routes. Since the system was installed with Solaris 10 5/08 and had more recent patches, we could use the defrouter feature introduced in 10/08 to make setting up routes for the non-global zones a little easier. This was needed because the global zone was using only one NIC, and it was not going to be on the networks that the non-global zones were on.

    What made the configuration a little unique was that the web server needs a default router to the Internet, while the application server needs a route to other systems behind a different router. Individually, everything is fine. However, the web1 zone also needs to be on the network that the application and web router are on, so it ends up having two interfaces.

    Lets look at web1 when only it is running.

    web1# ifconfig -a4
    lo0:1: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
            inet 127.0.0.1 netmask ff000000
    bge1:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
            inet 172.16.1.41 netmask ffffff00 broadcast 172.16.1.255
    bge2:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 4
            inet 192.168.51.41 netmask ffffff00 broadcast 192.168.51.255
    web1# netstat -rn
    
    Routing Table: IPv4
      Destination           Gateway           Flags  Ref     Use     Interface
    -------------------- -------------------- ----- ----- ---------- ---------
    default              172.16.1.1           UG        1          0 bge1
    172.16.1.0           172.16.1.41          U         1          0 bge1:1
    192.168.51.0         192.168.51.41        U         1          0 bge2:1
    224.0.0.0            172.16.1.41          U         1          0 bge1:1
    127.0.0.1            127.0.0.1            UH        5         34 lo0:1
    

    The zone is on two interface, bge1 and bge2, and has a default route that uses bge1. However, when zone app1 is running, there is a second default route, on bge2. The same is true if app2 or odr are running. Note that these three zones are only on bge2.

    app1# ifconfig -a4
    lo0:1: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
            inet 127.0.0.1 netmask ff000000
    bge2:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 4
            inet 192.168.51.43 netmask ffffff00 broadcast 192.168.51.255
    app1# netstat -rn
    
    Routing Table: IPv4
      Destination           Gateway           Flags  Ref     Use     Interface
    -------------------- -------------------- ----- ----- ---------- ---------
    default              192.168.51.1         UG        1          0 bge2
    192.168.51.0         192.168.51.43        U         1          0 bge2:1
    224.0.0.0            192.168.51.43        U         1          0 bge2:1
    127.0.0.1            127.0.0.1            UH        3         51 lo0:1
    

    In the meantime, this is what happens in web1.

    web1# netstat -rn
    
    Routing Table: IPv4
      Destination           Gateway           Flags  Ref     Use     Interface
    -------------------- -------------------- ----- ----- ---------- --------- 
    default              192.168.51.1         UG        1          0 bge2
    default              172.16.1.1           UG        1          0 bge1 
    172.16.1.0           172.16.1.41          U         1          0 bge1:1
    192.168.51.0         192.168.51.41        U         1          0 bge2:4
    224.0.0.0            172.16.1.41          U         1          0 bge1:1
    127.0.0.1            127.0.0.1            UH        6        132 lo0:4
    

    With any of the other zones running, web1 now has two default routes. And it only happens in web1, as it is the only zone with its public facing data link bge1 and a shared data link (bge2).

    Traffic to any system on either the 192.168.51.0 or 172.16.1.1 network will have no issues. Every time IP needs to determine a new path for a system not on either of those two networks, it will pick a route, and it will round-robin between the two default routes. Thus approximately half the time, connections will fail to establish, or possibly existing connections will not work if they have been idle for a while.

    This is how IP is supposed to work, so there is technically nothing wrong. It is a features of zones and a shared IP Instance. [2009.06.23: For background on why IP works this way, see James' blog].

    The only problem is that this is not what the customer wants!

    One option would be to force all traffic between the web and application tier out the bge1 interface, putting it on the wire. This may not be desirable for security reasons, and introduces latencies since traffic now goes on the wire. Another option would be to use exclusive IP Instances for the web servers. For each web zone, and this example only has one, it would required two additional data links (NICs). That would add up. Also, this configuration is targeted to be used with Solaris Cluster's scalable services, and those must be in shared IP Instance zones. Hummm....as I like to say.

    We didn't know about the shared IP Instance restriction of Solaris Cluster, and as the customer was considering how they were going to add additional NICs to all the systems, something slowly developed in my mind. How about creating a shared, dummy network between the web and application tier? They had one spare NIC, and with shared IP it does not even need to be connected to a switch port, since IP will loop all traffic back anyway!

    The more I thought about it, the more I liked it, and I could not see anything wrong with it. At least not technically as I understood Solaris. Operationally, for the customer, it might be a little awkward.

    Here is what I was thinking of...

    With this configuration the web1 zone has a default router only to the Internet and it can reach odr, and if necessary, app1 and app2, directly via the new network. And app1 and app2 only have a single default route to get to the Intranet. The nice thing is that bge3 does not even need to be up. That is visible with ifconfig output, where bge3 is not showing a RUNNING flag, which indicates the port is not connected (or in my case has been disabled on the switch).

    global# ifconfig -a4
    ...
    bge0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
            inet 129.154.53.125 netmask ffffff00 broadcast 129.154.53.255
            ether 0:3:ba:e3:42:8b
    bge1: flags=1000842<BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
            inet 0.0.0.0 netmask 0
            ether 0:3:ba:e3:42:8c
    bge2: flags=1000842<BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 4
            inet 0.0.0.0 netmask 0
            ether 0:3:ba:e3:42:8d 
    bge3: flags=1000802<BROADCAST,MULTICAST,IPv4> mtu 1500 index 5 
            inet 0.0.0.0 netmask 0
            ether 0:3:ba:e3:42:8e
    ...
    
    And within web1 there is now only one default route.
    web1# netstat -rn
    
    Routing Table: IPv4
      Destination           Gateway           Flags  Ref     Use     Interface
    -------------------- -------------------- ----- ----- ---------- --------- 
    default              172.16.1.1           UG        1         17 bge1 
    172.16.1.0           172.16.1.41          U         1          2 bge1:1
    192.168.52.0         192.168.52.41        U         1          2 bge3:1
    224.0.0.0            172.16.1.41          U         1          0 bge1:1
    127.0.0.1            127.0.0.1            UH        4        120 lo0:1
    
    In the customer's case, multiple systems were being used, so the private networks were connected together so that a web zone on one system could access an odr zone on another. I am showing the simple, single system case since it is so convenient.

    If I were using Solaris Express Community Edition (SX-CE) or OpenSolaris 2009.06 Developer Builds, with the Crossbow bits and virtual NICs (VNICs) available, I wouldn't even have needed to use that physical interface. Both are available here.

    I hope this trick might help others out in the future.

    Steffen

    Tuesday Apr 14, 2009

    Using IPMP with link based failure detection

    Solaris has had a feature to increase network availability called IP Multipathing (IPMP). Initially it required a test address on every data link in an IPMP group, where the test addresses were used as the source IP address to probe network elements for path availability. One of the benefits of probe-based failure detection is that it can extend beyond the directly connected link(s), and verify paths through the attached switch(es) to what typically is a router or other redundant element to provide available services.

    Having one IP address (whether a public or a private, non routable) per data link and also the separate address(es) for the application(s) turns out to be a lot of addresses to allocate and administer. And since the default of five probes spaced two seconds apart meant a failure would take at least ten (10) seconds to be detected, something more was needed.

    So in the Solaris 9 timeframe the ability to also do link based failure detection was delivered. It requires specific NICs whose driver has the ability to notify the system that a link has failed. The Introduction to IPMP in the Solaris 10 Systems Administrators Guide on IP Services lists the NICs that support link state notification. Solaris 10 supports configuring IPMP with only link based failure detection.

    global# more /etc/hostname.bge[12]
    ::::::::::::::
    /etc/hostname.bge1
    ::::::::::::::
    10.1.14.140/26 group ipmp1 up
    ::::::::::::::
    /etc/hostname.bge2
    ::::::::::::::
    group ipmp1 standby up
    
    On system boot, there will be an indication on the console that since no test addresses are defined, probe-based failure detection is disabled.

    Apr 10 10:57:20 in.mpathd[168]: No test address configured on interface bge2; disabling probe-based failure detection on it
    Apr 10 10:57:20 in.mpathd[168]: No test address configured on interface bge1; disabling probe-based failure detection on it
    
    Looking at the interfaces configured,
    global# ifconfig -a4
    lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
            inet 127.0.0.1 netmask ff000000
    bge0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
            inet 129.154.53.125 netmask ffffff00 broadcast 129.154.53.255
            ether 0:3:ba:e3:42:8b
    bge1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
            inet 10.1.14.140 netmask ffffffc0 broadcast 10.1.14.191
            groupname ipmp1
            ether 0:3:ba:e3:42:8c
    bge1:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
            inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
    bge2: flags=69000842<BROADCAST,RUNNING,MULTICAST,IPv4,NOFAILOVER,STANDBY,INACTIVE> mtu 0 index 4
            inet 0.0.0.0 netmask 0
            groupname ipmp1
            ether 0:3:ba:e3:42:8d
    
    you will notice that two of the three interfaces have no address (0.0.0.0). Also, the data address is on a physical interface on bge1. At the same time bge2 has the 0.0.0.0 address. On the failure of bge1,
    Apr 10 14:34:53 global bge: NOTICE: bge1: link down
    Apr 10 14:34:53 global in.mpathd[168]: The link has gone down on bge1
    Apr 10 14:34:53 global in.mpathd[168]: NIC failure detected on bge1 of group ipmp1
    Apr 10 14:34:53 global in.mpathd[168]: Successfully failed over from NIC bge1 to NIC bge2
    
    
    global# ifconfig -a4
    lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
            inet 127.0.0.1 netmask ff000000
    bge0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
            inet 129.154.53.125 netmask ffffff00 broadcast 129.154.53.255
            ether 0:3:ba:e3:42:8b
    bge1: flags=19000802<BROADCAST,MULTICAST,IPv4,NOFAILOVER,FAILED> mtu 0 index 3
            inet 0.0.0.0 netmask 0
            groupname ipmp1
            ether 0:3:ba:e3:42:8c
    bge2: flags=21000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,STANDBY> mtu 1500 index 4
            inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
            groupname ipmp1
            ether 0:3:ba:e3:42:8d
    bge2:1: flags=21000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,STANDBY> mtu 1500 index 4
            inet 10.1.14.140 netmask ffffffc0 broadcast 10.1.14.191
    
    the data address is migrated onto bge2:1. I find this a little confusing. However, I don't know any way around it on Solaris 10. The IPMP Re-architecture makes this a lot easier!

    Using Probe-based IPMP with non-global zones

    Configuring a shared IP Instance non-global zone and utilizing IPMP managed in the global zone is very easy.

    The IPMP configuration is very simple. Interface bge1 is active, and bge2 is in stand-by mode.

    global# more /etc/hostname.bge[12]
    ::::::::::::::
    /etc/hostname.bge1
    ::::::::::::::
    group ipmp1 up
    ::::::::::::::
    /etc/hostname.bge2
    ::::::::::::::
    group ipmp1 standby up
    
    My zone configuration is:
    global# zonecfg -z zone1 info
    zonename: zone1
    zonepath: /zones/zone1
    brand: native
    autoboot: false
    bootargs:
    pool:
    limitpriv:
    scheduling-class:
    ip-type: shared
    inherit-pkg-dir:
            dir: /lib
    inherit-pkg-dir:
            dir: /platform
    inherit-pkg-dir:
            dir: /sbin
    inherit-pkg-dir:
            dir: /usr
    net:
            address: 10.1.14.141/26
            physical: bge1
    
    Prior to booting, the network configuration is:
    global# ifconfig -a4
    lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
            inet 127.0.0.1 netmask ff000000
    lo0:1: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
            zone zone1
            inet 127.0.0.1 netmask ff000000
    bge0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
            inet 129.154.53.125 netmask ffffff00 broadcast 129.154.53.255
            ether 0:3:ba:e3:42:8b
    bge1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
            inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
            groupname ipmp1
            ether 0:3:ba:e3:42:8c
    bge2: flags=21000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,STANDBY> mtu 1500 index 4
            inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
            groupname ipmp1
            ether 0:3:ba:e3:42:8d
    
    After booting, the network looks like this:
    global# ifconfig -a4
    lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
            inet 127.0.0.1 netmask ff000000
    lo0:1: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
            zone zone1
            inet 127.0.0.1 netmask ff000000
    bge0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
            inet 129.154.53.125 netmask ffffff00 broadcast 129.154.53.255
            ether 0:3:ba:e3:42:8b
    bge1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
            inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
            groupname ipmp1
            ether 0:3:ba:e3:42:8c
    bge1:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
            zone zone1
            inet 10.1.14.141 netmask ffffffc0 broadcast 10.1.14.191
    bge2: flags=21000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,STANDBY> mtu 1500 index 4
            inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
            groupname ipmp1
            ether 0:3:ba:e3:42:8d
    

    So a simple case for the use of IPMP, without the need for test addresses! Other IPMP configurations, such as more than two data links, or active-active, are also supported with link based failure detection. The more links involved, the more test addresses are saved with link based failure detection. Since writing this entry I was involved in a customer configuration where this is saving several hundred IP address and their management (such as avoiding duplicate address). That customer is willing to forgo the benefit of probes testing past the local switch port.

    Steffen

    About

    Stw-Oracle

    Search

    Archives
    « June 2016
    SunMonTueWedThuFriSat
       
    1
    2
    3
    4
    5
    6
    7
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
      
           
    Today