Friday Jan 15, 2010

Solaris 10 & OpenSolaris p2p, p2v, v2p

Solaris has always been one of the easier OSes to move between machines. When we would try to boot Solaris on a newly prototyped SPARC machine (back in the day when I worked in the SPARC H/W org), we would install a Solaris image on an older SPARC machine, and then fix it up so that it could boot on the prototype. Once OBP could load unix, it usually took us a day or two to work through the H/W, FPGA, OBP, and Solaris bugs so we could get to multiuser login.

Now that we have zfs root, it's more difficult, and at the same time, much more powerful. zfs stores platform specific boot information in it's meta data which isn't easily accessed, making it more difficult. But, zfs supports live snapshots which makes it much more powerful.

With zfs, we can "easily move" the machine from one "machine" to another. This generally applies to S10 and Opensolaris, as long as it's running a zfs root.. You can go from a physical machine, VirtualBox guest, xVM guest, etc. to a different physical machine, VirtualBox guest, xVM guest, etc.

For an example, I thought I would share some tricks on how you can transform a x86 box running OpenSolaris to a VirtualBox guest without ever shutting down or rebooting the x86 box.

The first thing you want to do is boot a new VirtualBox guest with an OpenSolaris live install iso. You want to make sure that the zfs version in the install iso matches the zfs version on your x86 box.

Once you have booted the install iso, open a shell. Enable ssh, and then run format. Write down the disk your going to use (e.g. c4t0d0s0), then run fdisk from within format. Usually you will create a single disk partition for Solaris here.. Exit fdisk saving your changes.

Now partition your Solaris disk. Select the 0 partition, set the tag to "root" (without the quotes), and set the range from 1 to the last cylinder. Make sure partition 8 is set to 0 - 0 cylinders, then label the disk and exit format.

On your x86 box, write down your hostname and hostid info.

	: core2[1]#; hostname
        core2
	: core2[1]#; hostid
	05bdb9c2
	: core2[1]#; echo "hw_serial,0xa?B" | mdb -k
	hw_serial:
	hw_serial: 39 36 33 31 39 39 33 38 0 0
On your VirtualBox guest, update hostname and hostid to match your x86 system.
	root@opensolaris:~# hostname core2
	root@core2:~# hostid
	00041f55
	root@core2:~# echo "hw_serial/v 39 36 33 31 39 39 33 38 0 0" | mdb -kw
	root@core2:~# hostid
	05bdb9c2
	root@core2:~# 
Now, create the zfs root on the VirtualBox guest using the disk you saw in format. Make sure that you create the zpool on slice 0 (s0). If your moving multiple pools, you'll probably want to setup multiple pools on the guest now too.
   zpool create -R /a -f rpool /dev/dsk/c4t0d0s0
Back to the x86 system, snapshot the root pool, then send it to the opensolaris guest (which is 192.168.0.117 in my example). Do the same for all pools you want to move.
   zfs snapshot -r rpool@p2v
   zfs send -R rpool@p2v | ssh jack@192.168.0.117 pfexec /usr/sbin/zfs receive -dF rpool
Once this completes, if the x86 system is actively being used, you'll want to shut down the apps your using (e.g. databases, etc.), take a snapshot again, then do a differencing zfs send to do a final sync.

Back on the VirtualBox guest, lets finalize the disk. Set bootfs to the BE you want to boot.

   zpool set bootfs='rpool/ROOT/--your-bootfs--' rpool
On the VirtualBox guest, install grub
   /a/sbin/installgrub /a/boot/grub/stage1 /a/boot/grub/stage2 /dev/rdsk/c4t0d0s0
If your NICs are different, you need to update them. If you have hostname.--nic-- and dhcp.--nic-- files, update them to point to your new NIC(s). You may have neither. Or you may only have a hostname.--nic--. You may also have to update /a/etc/nwam/llp. For multiple BEs, don't forget to update the NICs in all the BEs you want to boot.
   devfsadm -r /a -i e1000g
   mv /a/etc/hostname.--oldnic-- /a/etc/hostname.e1000g0
   mv /a/etc/hostname.--oldnic-- /a/etc/dhcp.e1000g0
If your using the same IP, you may want to take your x86 box off the net now... Eject the CDROM and reboot your VirtualBox guest... Hopefully it booted right up :-)

Thursday Dec 03, 2009

Updated create-be

I've fixed a few bugs in the create-be script. The scripts lets you create a new (non COW) OpenSolaris BE on a nevada zfs root based system or OpenSolaris system. You can use this to transition a Nevada zfs root based system to OpenSolaris. You can also choose an arbitrary OpenSolaris build (i.e. if you want to downgrade).

DISCLAIMER: this is totally unsupported by Sun, could mess up your system, etc. etc.

You can grab an updated copy here.

create-be --build=128a --bename=osol128 --repo=http://pkg.opensolaris.org/dev

Monday Aug 17, 2009

How to do a fresh install into a BE on OpenSolaris

Lately, I've been playing around with smaller OpenSolaris guests. Not wanting to muck around with setting up an AI installer, I ended up doing a "fresh install" into a new Boot Environment (BE). Although none of this is overly practical, it is interesting.. I thought I would share what I learned..

This works just as well on metal, in a VirtualBox OpenSolaris guest, and in a xVM OpenSolaris guest. For this example, I'll run through an xVM OpenSolaris guest. You'll need to change your package list slightly for the other ones.

First, a little off subject, lets install a fresh OpenSolaris guest.. This assumes your already running a OpenSolaris dom0.

: core2[1]#; virt-install -n opensolaris -r 1024  -p --nographics \\
--noautoconsole -l /net/192.168.0.71/tank/isos/solaris/os2009.06.iso \\
-f /vdisks/opensolaris -s 20

Starting install...
Retrieving file unix...   100% |=========================| 1.4 MB    00:00
Retrieving file x86.micro 100% |=========================|  36 MB    00:01
Creating storage file...  100% |=========================|   20 B    00:00
Creating domain...                                                 0 B 00:05
Domain installation still in progress. You can reconnect to
the console to complete the installation process.
: core2[1]#; virsh console opensolaris
v3.3.2-xvm chgset 'Wed Aug 12 17:12:49 2009 -0700 18433:bd9f134b1e1b'
SunOS Release 5.11 Version snv_111b 32-bit
Copyright 1983-2009 Sun Microsystems, Inc.  All rights reserved.
Use is subject to license terms.
Hostname: opensolaris
Remounting root read/write
Probing for device nodes ...
Preparing live image for use
Done mounting Live image
USB keyboard
 1. Albanian                      23. Lithuanian
 2. Belarusian                    24. Latvian
 3. Belgian                       25. Macedonian
 4. Brazilian                     26. Malta_UK
 5. Bulgarian                     27. Malta_US
 6. Canadian-Bilingual            28. Norwegian
 7. Croatian                      29. Polish
 8. Czech                         30. Portuguese
 9. Danish                        31. Russian
10. Dutch                         32. Serbia-And-Montenegro
11. Finnish                       33. Slovenian
12. French                        34. Slovakian
13. French-Canadian               35. Spanish
14. Hungarian                     36. Swedish
15. German                        37. Swiss-French
16. Greek                         38. Swiss-German
17. Icelandic                     39. Traditional-Chinese
18. Italian                       40. TurkishQ
19. Japanese-type6                41. TurkishF
20. Japanese                      42. UK-English
21. Korean                        43. US-English
22. Latin-American
To select the keyboard layout, enter a number [default 43]:

 1. Arabic
 2. Chinese - Simplified
 3. Chinese - Traditional
 4. Czech
 5. Dutch
 6. English
 7. French
 8. German
 9. Greek
10. Hebrew
11. Hungarian
12. Indonesian
13. Italian
14. Japanese
15. Korean
16. Polish
17. Portuguese - Brazil
18. Russian
19. Slovak
20. Spanish
21. Swedish
To select desktop language, enter a number [default is 6]:
User selected: English
Configuring devices.
Mounting cdroms
Reading ZFS config: done.

opensolaris console login: jack
Password:
Sun Microsystems Inc.   SunOS 5.11      snv_111b        November 2008
jack@opensolaris:~$
jack@opensolaris:~$ Aug 17 02:58:54 opensolaris in.routed[696]: route 0.0.0.0/8 --> 0.0.0.0 nexthop is not directly connected

jack@opensolaris:~$ ifconfig xnf0
xnf0: flags=1004843 mtu 1500 index 2
        inet 192.168.0.147 netmask ffffff00 broadcast 192.168.0.255
jack@opensolaris:~$
jack@opensolaris:~$ 
(\^] to exit console)
: core2[1]#; /usr/lib/xen/bin/xenstore-ls | grep passwd
     passwd = "EJzbFnyg"
: core2[1]#;
Connect a vncviewer to the OpenSolaris guest, using the vnc password above.
: core2[1]#; vncviewer 192.168.0.147:0 &> /dev/null &
Once you've completed the OpenSolaris install, lets disable gdm and intrd (I like to do this for small guests), and apply the workaround for a zfs bug (6840704 osol_0906 PV guests sometimes hang at login prompt). Then lets create a working BE. If we mess anything up, we can always go back to where we started from.
: core2[1]#; virsh start opensolaris;virsh console opensolaris
Domain opensolaris started

v3.3.2-xvm chgset 'Wed Aug 12 17:12:49 2009 -0700 18433:bd9f134b1e1b'
SunOS Release 5.11 Version snv_111b 64-bit
Copyright 1983-2009 Sun Microsystems, Inc.  All rights reserved.
Use is subject to license terms.
Hostname: opensolaris
Configuring devices.
Loading smf(5) service descriptions: 150/150
svccfg import warnings. See /var/svc/log/system-manifest-import:default.log .
Reading ZFS config: done.
Mounting ZFS filesystems: (6/6)
Creating new rsa public/private host key pair
Creating new dsa public/private host key pair

opensolaris console login: myuser
Password:
Sun Microsystems Inc.   SunOS 5.11      snv_111b        November 2008
myuser@opensolaris:~$ pfexec su -
Aug 17 10:58:54 opensolaris su: 'su root' succeeded for myuser on /dev/console
Sun Microsystems Inc.   SunOS 5.11      snv_111b        November 2008
root@opensolaris:~# svcadm disable gdm
root@opensolaris:~# svcadm disable intrd
root@opensolaris:~# echo -e "\\n"\\
"forceload: drv/domcaps\\n"\\
"forceload: drv/xencons\\n"\\
"forceload: drv/xenbus\\n"\\
"forceload: drv/balloon\\n"\\
"forceload: drv/evtchn\\n"\\
"forceload: drv/privcmd\\n"\\
"forceload: drv/xdf\\n"\\
"forceload: drv/xnf\\n\\n" >> /etc/system 
root@opensolaris:~# bootadm update-archive
updating //platform/i86pc/boot_archive
updating //platform/i86pc/amd64/boot_archive
root@opensolaris:~# beadm create snv111b
root@opensolaris:~# beadm activate snv111b
root@opensolaris:~# reboot
Aug 17 11:36:01 opensolaris reboot: initiated by myuser on /dev/console
syncing file systems... done
rebooting...
v3.3.2-xvm chgset 'Wed Aug 12 17:12:49 2009 -0700 18433:bd9f134b1e1b'
SunOS Release 5.11 Version snv_111b 64-bit
Copyright 1983-2009 Sun Microsystems, Inc.  All rights reserved.
Use is subject to license terms.
Hostname: opensolaris
Reading ZFS config: done.
Mounting ZFS filesystems: (6/6)

opensolaris console login:
Now lets get to the point of this blog entry. Lets do a fresh install to a new BE.

When you do a beadm create, you are creating a copy-on-write(COW) based clone of your current root. We don't want that though.. We want an empty root directory to start with.. So we'll create it by hand.. We'll need a uuid. You can write a little program using libuuid(3LIB) or just make one up. Also, we'll put the mountdir to a temporary location during the install.

opensolaris console login: myuser
Password:
Last login: Mon Aug 17 11:33:40 on console
Sun Microsystems Inc.   SunOS 5.11      snv_111b        November 2008
myuser@opensolaris:~$ pfexec su -
Aug 17 11:39:04 opensolaris su: 'su root' succeeded for myuser on /dev/console
Sun Microsystems Inc.   SunOS 5.11      snv_111b        November 2008
root@opensolaris:~# zfs create rpool/ROOT/small-be
root@opensolaris:~# zfs set canmount=noauto rpool/ROOT/small-be
root@opensolaris:~# zfs set mountpoint="/mnt" rpool/ROOT/small-be
root@opensolaris:~# zfs set org.opensolaris.libbe:uuid=f0fa607f-7d1c-66ca-caf9-e04cbf rpool/ROOT/small-be
root@opensolaris:~# zfs mount rpool/ROOT/small-be 
Next, we'll setup a new packaging environment in the new "BE", install a custom set of packages, seed SMF, setup vfstab and /dev, cleanup some OpenSolaris cruft, setup the new BE to prompt for configuration on the next boot, and then apply the workaround for the zfs bug mentioned above.
root@opensolaris:~# export ROOTDIR=/mnt
root@opensolaris:~# pkg image-create -f -F -a opensolaris.org=http://pkg.opensolaris.org/ $ROOTDIR
root@opensolaris:~# export PKGS="entire \\
        SUNWcsd \\
        SUNWcs \\
        SUNWcarx \\
        SUNWcakrx \\
        SUNWos86r \\
        SUNWkvm \\
        SUNWrmodr \\
        SUNWpsdcr \\
        SUNWpsdir \\
        SUNWcnetr \\
        SUNWesu \\
        SUNWkey \\
        SUNWuprl \\
        SUNWkrb \\
        SUNWbip \\
        SUNWzfskr \\
        SUNWbash \\
        SUNWipf \\
        SUNWbash \\
        SUNWgrub \\
        SUNWtoo \\
        SUNWbind \\
        SUNWrcmdc \\
        SUNWmkcd \\
        SUNWPython \\
        SUNWPython-extra \\
        SUNWipkg \\
        SUNWinstall \\
        SUNWbeadm \\
        SUNWadmap \\
        SUNWadmlib-sysid \\
        SUNWadmr"
root@opensolaris:~# pkg -R $ROOTDIR install $PKGS
DOWNLOAD                                    PKGS       FILES     XFER (MB)
Completed                                  67/67   8542/8542   93.32/93.32

PHASE                                        ACTIONS
Install Phase                            14726/14726
PHASE                                          ITEMS
Reading Existing Index                           8/8
Indexing Packages                              67/67
Optimizing Index...
PHASE                                          ITEMS
Indexing Packages                              67/67
root@opensolaris:~# rm -rf $ROOTDIR/var/pkg/download/\*
root@opensolaris:~# /usr/bin/cp $ROOTDIR/lib/svc/seed/global.db $ROOTDIR/etc/svc/repository.db
root@opensolaris:~# chmod 600 $ROOTDIR/etc/svc/repository.db
root@opensolaris:~# cd $ROOTDIR/var/svc/profile/
root@opensolaris:/mnt/var/svc/profile# ln -s generic_limited_net.xml generic.xml
root@opensolaris:/mnt/var/svc/profile# ln -s ns_files.xml name_service.xml
root@opensolaris:/mnt/var/svc/profile# cd
root@opensolaris:~# cp /etc/vfstab $ROOTDIR/etc/vfstab
root@opensolaris:~# /usr/sbin/devfsadm -R $ROOTDIR
root@opensolaris:~# echo -e "/lib/svc/method/sshd\\n\\
/usr/sbin/sysidkbd\\n\\
/usr/sbin/sysidpm\\n\\
/lib/svc/method/net-nwam\\n\\
/usr/lib/cc-ccr/bin/eraseCCRRepository" > $ROOTDIR/etc/.sysidconfig.apps
root@opensolaris:~# /usr/sbin/sys-unconfig -R $ROOTDIR
sys-unconfig started Mon Aug 17 12:32:56 2009
rm: cannot remove `/mnt/etc/vfstab.sys-u': No such file or directory
grep: /mnt/etc/dumpadm.conf: No such file or directory
sys-unconfig completed Mon Aug 17 12:32:56 2009
root@opensolaris:~# cat $ROOTDIR/etc/passwd | sed '/\^jack/d' > $ROOTDIR/etc/passwd.new;mv -f $ROOTDIR/etc/passwd.new $ROOTDIR/etc/passwd
root@opensolaris:~# cat $ROOTDIR/etc/shadow | sed '/\^jack/d' > $ROOTDIR/etc/shadow.new;mv -f $ROOTDIR/etc/shadow.new $ROOTDIR/etc/shadow
root@opensolaris:~# cat $ROOTDIR/etc/user_attr | sed 's/\^root::::type=role;/root::::/g' > $ROOTDIR/etc/user_attr.new;mv -f $ROOTDIR/etc/user_attr.new $ROOTDIR/etc/user_attr
root@opensolaris:~# echo -e "\\n"\\
"forceload: drv/domcaps\\n"\\
"forceload: drv/xencons\\n"\\
"forceload: drv/xenbus\\n"\\
"forceload: drv/balloon\\n"\\
"forceload: drv/evtchn\\n"\\
"forceload: drv/privcmd\\n"\\
"forceload: drv/xdf\\n"\\
"forceload: drv/xnf\\n\\n" >> $ROOTDIR/etc/system
root@opensolaris:~# /usr/sbin/bootadm update-archive -R $ROOTDIR
updating /mnt//platform/i86pc/boot_archive
updating /mnt//platform/i86pc/amd64/boot_archive
Now lets unmount our new BE, and setup the correct mountpoint.
root@opensolaris:~# zfs umount rpool/ROOT/small-be
root@opensolaris:~# zfs set mountpoint="/" rpool/ROOT/small-be
root@opensolaris:~# beadm list
BE          Active Mountpoint Space   Policy Created
--          ------ ---------- -----   ------ -------
opensolaris -      -          4.01M   static 2009-08-17 10:05
small-be    -      -          357.76M static 2009-08-17 11:49
snv111b     NR     /          3.16G   static 2009-08-17 11:39
root@opensolaris:~# 
Time to switch to our new BE and run through the configure.
root@opensolaris:~# beadm activate small-be
root@opensolaris:~# beadm list
BE          Active Mountpoint Space   Policy Created
--          ------ ---------- -----   ------ -------
opensolaris -      -          4.01M   static 2009-08-17 10:05
small-be    R      -          357.76M static 2009-08-17 11:49
snv111b     N      /          3.16G   static 2009-08-17 11:39
root@opensolaris:~# reboot
Aug 17 12:38:24 opensolaris reboot: initiated by myuser on /dev/console
syncing file systems... done
rebooting...
v3.3.2-xvm chgset 'Wed Aug 12 17:12:49 2009 -0700 18433:bd9f134b1e1b'
SunOS Release 5.11 Version snv_111b 64-bit
Copyright 1983-2009 Sun Microsystems, Inc.  All rights reserved.
Use is subject to license terms.
Hostname: unknown
Configuring devices.
Loading smf(5) service descriptions: 78/78
Reading ZFS config: done.
Mounting ZFS filesystems: (8/8)


What type of terminal are you using?
 1) ANSI Standard CRT
[CUT]
Configuring network interface addresses: xnf0.
System identification is completed.

unknown console login: root
Password:
Aug 17 09:45:08 unknown login: ROOT LOGIN /dev/console
Sun Microsystems Inc.   SunOS 5.11      snv_111b        November 2008
root@unknown:~#
We went from around 3G to around 370M for disk footprint using the custom set of packages above. Not too bad, but we can improve this over time.
Filesystem            kbytes    used   avail capacity  Mounted on
rpool/ROOT/small-be  20514816  373259 15924049     3%    /
rpool/ROOT/snv111b   20514816 3005343 15920934    16%    /mnt
Now lets create a COW based clone of out new BE and switch to it.

root@unknown:~# beadm list
BE          Active Mountpoint Space   Policy Created
--          ------ ---------- -----   ------ -------
opensolaris -      -          4.01M   static 2009-08-17 07:05
small-be    NR     /          364.43M static 2009-08-17 08:49
snv111b     -      -          3.17G   static 2009-08-17 08:39
root@unknown:~# bootadm update-archive
updating //platform/i86pc/boot_archive
updating //platform/i86pc/amd64/boot_archive
root@unknown:~# beadm create small-be-clone
root@unknown:~# beadm activate small-be-clone
root@unknown:~# beadm list
BE             Active Mountpoint Space   Policy Created
--             ------ ---------- -----   ------ -------
opensolaris    -      -          4.01M   static 2009-08-17 07:05
small-be       N      /          19.5K   static 2009-08-17 08:49
small-be-clone R      -          364.55M static 2009-08-17 09:56
snv111b        -      -          3.17G   static 2009-08-17 08:39
root@unknown:~#

root@unknown:~# reboot
Aug 17 09:57:20 unknown reboot: initiated by root on /dev/console
syncing file systems... done
rebooting...
v3.3.2-xvm chgset 'Wed Aug 12 17:12:49 2009 -0700 18433:bd9f134b1e1b'
SunOS Release 5.11 Version snv_111b 64-bit
Copyright 1983-2009 Sun Microsystems, Inc.  All rights reserved.
Use is subject to license terms.
Hostname: unknown
Reading ZFS config: done.
Mounting ZFS filesystems: (9/9)

unknown console login: root
Password:
Aug 17 13:46:34 unknown login: ROOT LOGIN /dev/console
Last login: Mon Aug 17 13:44:19 on console
Sun Microsystems Inc.   SunOS 5.11      snv_111b        November 2008
root@unknown:~#
root@unknown:~# beadm list
BE             Active Mountpoint Space   Policy Created
--             ------ ---------- -----   ------ -------
opensolaris    -      -          4.01M   static 2009-08-17 10:05
small-be       -      -          3.78M   static 2009-08-17 11:49
small-be-clone NR     /          420.45M static 2009-08-17 12:56
snv111b        -      -          3.17G   static 2009-08-17 11:39 
From here, we are going to try two different things.. The first thing we want to try is to make sure we can delete the new BE and it's clone. We also want to get a little crazy and see if we can get rid of the original opensolaris and snv111b snapshot. But before we continue on, lets snapshot the vdisk so we can rollback to this point so we don't need to install from scratch again.
root@unknown:~# poweroff
Aug 17 13:47:20 unknown poweroff: initiated by root on /dev/console
syncing file systems... done
: core2[1]#; vdiskadm -u xvm snapshot /vdisks/opensolaris@pre-destroy 
: core2[1]#; virsh start opensolaris;virsh console opensolaris 
Lets remove the small-be-clone clone and the small-be BE. For small-be, since we created this by hand, we will want to remove it by hand.
opensolaris console login: root
Password:
Last login: Mon Aug 17 11:46:42 on console
Sun Microsystems Inc.   SunOS 5.11      snv_111b        November 2008
root@unknown:~# beadm activate snv111b
root@unknown:~# reboot
Aug 17 09:01:18 unknown reboot: initiated by root on /dev/console
syncing file systems... done
rebooting...
v3.3.2-xvm chgset 'Wed Aug 12 17:12:49 2009 -0700 18433:bd9f134b1e1b'
SunOS Release 5.11 Version snv_111b 64-bit
Copyright 1983-2009 Sun Microsystems, Inc.  All rights reserved.
Use is subject to license terms.
Hostname: opensolaris
Reading ZFS config: done.
Mounting ZFS filesystems: (9/9) 

opensolaris console login: myuser
Password:
Last login: Mon Aug 17 11:46:42 on console
Sun Microsystems Inc.   SunOS 5.11      snv_111b        November 2008 
myuser@opensolaris:~$ pfexec su -
Aug 17 13:10:19 opensolaris su: 'su root' succeeded for myuser on /dev/console
Sun Microsystems Inc.   SunOS 5.11      snv_111b        November 2008
root@opensolaris:~# beadm destroy small-be-clone
Are you sure you want to destroy small-be-clone? This action cannot be undone(y/[n]): y
root@opensolaris:~# zfs destroy rpool/ROOT/small-be
root@opensolaris:~# beadm list
BE          Active Mountpoint Space Policy Created
--          ------ ---------- ----- ------ -------
opensolaris -      -          4.01M static 2009-08-17 10:05
snv111b     NR     /          3.18G static 2009-08-17 11:39
root@opensolaris:~#
Now, lets rollback the vdisk and try the second part of our test.
root@unknown:~# poweroff
Aug 17 15:14:27 unknown poweroff: initiated by root on /dev/console
syncing file systems... done
: core2[1]#; vdiskadm -u xvm rollback /vdisks/opensolaris@pre-destroy
: core2[1]#; virsh start opensolaris;virsh console opensolaris 
Domain opensolaris started

v3.3.2-xvm chgset 'Wed Aug 12 17:12:49 2009 -0700 18433:bd9f134b1e1b'
SunOS Release 5.11 Version snv_111b 64-bit
Copyright 1983-2009 Sun Microsystems, Inc.  All rights reserved.
Use is subject to license terms.
Hostname: unknown
Reading ZFS config: done.
Mounting ZFS filesystems: (9/9)

unknown console login: root
Password: 
Aug 17 15:22:00 unknown login: ROOT LOGIN /dev/console
Last login: Mon Aug 17 13:46:34 on console
Sun Microsystems Inc.   SunOS 5.11      snv_111b        November 2008
root@unknown:~# beadm list
BE             Active Mountpoint Space   Policy Created          
--             ------ ---------- -----   ------ -------          
opensolaris    -      -          4.62M   static 2009-08-17 10:05 
small-be       -      -          1.72M   static 2009-08-17 13:23 
small-be-clone NR     /          377.48M static 2009-08-17 13:45 
snv111b        -      -          3.17G   static 2009-08-17 13:20 
root@unknown:~# 
root@unknown:~# beadm destroy snv111b
Are you sure you want to destroy snv111b? This action cannot be undone(y/[n]): y
root@unknown:~# beadm destroy opensolaris
Are you sure you want to destroy opensolaris? This action cannot be undone(y/[n]): y
root@unknown:~# beadm list
BE             Active Mountpoint Space   Policy Created
--             ------ ---------- -----   ------ -------
small-be       -      -          1.72M   static 2009-08-17 13:23
small-be-clone NR     /          377.48M static 2009-08-17 13:45
root@unknown:~# beadm activate small-be
root@unknown:~#
root@unknown:~# reboot
Aug 17 13:52:22 unknown reboot: initiated by root on /dev/console
syncing file systems... done
rebooting...
v3.3.2-xvm chgset 'Wed Aug 12 17:12:49 2009 -0700 18433:bd9f134b1e1b'
SunOS Release 5.11 Version snv_111b 64-bit
Copyright 1983-2009 Sun Microsystems, Inc.  All rights reserved.
Use is subject to license terms.
Hostname: unknown
Reading ZFS config: done.
Mounting ZFS filesystems: (7/7)

unknown console login: root
Password:
Aug 17 13:53:22 unknown login: ROOT LOGIN /dev/console
Last login: Mon Aug 17 13:44:19 on console
Sun Microsystems Inc.   SunOS 5.11      snv_111b        November 2008
root@unknown:~# beadm destroy small-be-clone
Are you sure you want to destroy small-be-clone? This action cannot be undone(y/[n]): y
root@unknown:~#
Finally, lets create a new cloned BE and upgrade it to snv118. Here we'll run into some minor pkg bugs.. They don't hurt anything though. You see that we grew the root a little after our upgrade.
root@unknown:~# beadm create snv118
root@unknown:~# beadm mount snv118 /mnt
root@unknown:~# pkg -R /mnt set-publisher -O http://pkg.opensolaris.org/dev opensolaris.org
root@unknown:~# pkg -R /mnt install SUNWipkg
No updates available for this image.
root@unknown:~# pkg -R /mnt install entire@0.5.11-0.118
DOWNLOAD                                    PKGS       FILES     XFER (MB)
Completed                                  73/73   4311/4311   79.34/79.34

PHASE                                        ACTIONS
Removal Phase                              1569/1569
Install Phase                              2619/2619
Update Phase                               5470/5698 
driver (softmac) upgrade (removal of policy'read_priv_set=net_rawaccess write_priv_set=net_rawaccess) failed: minor node spec required.
driver (vnic) upgrade (removal of policy'read_priv_set=net_rawaccess write_priv_set=net_rawaccess) failed: minor node spec required.
driver (aggr) upgrade (removal of policy'read_priv_set=net_rawaccess write_priv_set=net_rawaccess) failed: minor node spec required.
Update Phase                               5610/5698 
driver (dnet) upgrade (removal of policy'read_priv_set=net_rawaccess write_priv_set=net_rawaccess) failed: minor node spec required.
driver (elxl) upgrade (removal of policy'read_priv_set=net_rawaccess write_priv_set=net_rawaccess) failed: minor node spec required.
driver (iprb) upgrade (removal of policy'read_priv_set=net_rawaccess write_priv_set=net_rawaccess) failed: minor node spec required.
Update Phase                               5698/5698
PHASE                                          ITEMS
Reading Existing Index                           8/8
Indexing Packages                              73/73
Optimizing Index...
PHASE                                          ITEMS
Indexing Packages                              73/73
root@unknown:~# bootadm update-archive -R /mnt
updating /mnt//platform/i86pc/boot_archive
updating /mnt//platform/i86pc/amd64/boot_archive
root@unknown:~# beadm umount snv118
root@unknown:~# beadm activate snv118
root@unknown:~# reboot
Aug 17 14:05:37 unknown reboot: initiated by root on /dev/console
syncing file systems... done
rebooting...
v3.3.2-xvm chgset 'Wed Aug 12 17:12:49 2009 -0700 18433:bd9f134b1e1b'
SunOS Release 5.11 Version snv_118 64-bit
Copyright 1983-2009 Sun Microsystems, Inc.  All rights reserved.
Use is subject to license terms.
Hostname: unknown
Configuring devices.
Loading smf(5) service descriptions: 6/6
Reading ZFS config: done.
Mounting ZFS filesystems: (7/7)

unknown console login: root
Password:
Aug 17 14:06:39 unknown login: ROOT LOGIN /dev/console
Last login: Mon Aug 17 13:53:22 on console
Sun Microsystems Inc.   SunOS 5.11      snv_118 November 2008
root@unknown:~# rm -rf /var/pkg/download/\*
root@unknown:~# df -lk
Filesystem            kbytes    used   avail capacity  Mounted on
rpool/ROOT/snv118    20514816  418664 18756788     3%    /

Thursday Mar 12, 2009

How small can Solaris go

I've been playing around lately to see how small of a Solaris image I can get that will boot to a shell prompt... It turns out, it can get pretty small.. And it boots to the prompt in ~ 1 second... Although there's not a lot you can do with it :-)
SunOS Release 5.11 Version onnv-3.3-mrj 32-bit
Copyright 1983-2009 Sun Microsystems, Inc.  All rights reserved.
Use is subject to license terms.
WARNING: Last shutdown is later than time on time-of-day chip; check date.
strplumb: failed to initialize drv/dld
# df -lk
Filesystem            kbytes    used   avail capacity  Mounted on
/ramdisk:a             38255   15146   19284    44%    /
/devices                   0       0       0     0%    /devices
/dev                       0       0       0     0%    /dev
ctfs                       0       0       0     0%    /system/contract
proc                       0       0       0     0%    /proc
mnttab                     0       0       0     0%    /etc/mnttab
swap                 1768156       0 1768156     0%    /etc/svc/volatile
objfs                      0       0       0     0%    /system/object
sharefs                    0       0       0     0%    /etc/dfs/sharetab
# du -sk \*
31      boot
952     dev
1       devices
67      etc
6087    kernel
4849    lib
8       lost+found
1931    platform
11277   proc
122     sbin
1441    system
1       tmp
59      usr
3       var
# 

UPDATE: Developing with multiple BEs in OpenSolaris

Following up on Bart's comment, you can certainly use -R to perform the same operation... Not sure why I didn't think that could be used for pkg set-authority, but it can..

Funny since I use -R for my custom opensolaris builds..

root@unknown:~# df -lk
Filesystem            kbytes    used   avail capacity  Mounted on
/dev/dsk/c0d0s0       491522  367690   74680    84%    /
/devices                   0       0       0     0%    /devices
/dev                       0       0       0     0%    /dev
ctfs                       0       0       0     0%    /system/contract
proc                       0       0       0     0%    /proc
mnttab                     0       0       0     0%    /etc/mnttab
swap                  767856     336  767520     1%    /etc/svc/volatile
objfs                      0       0       0     0%    /system/object
sharefs                    0       0       0     0%    /etc/dfs/sharetab
/usr/lib/libc/libc_hwcap3.so.1
                      491522  367690   74680    84%    /lib/libc.so.1
fd                         0       0       0     0%    /dev/fd
swap                  767520       0  767520     0%    /tmp
swap                  767536      16  767520     1%    /var/run
root@unknown:~# pkg list | wc -l
      65
root@unknown:~# ps -ef
     UID   PID  PPID   C    STIME TTY         TIME CMD
    root     0     0   0 06:52:45 ?           0:01 sched
    root     1     0   0 06:52:46 ?           0:00 /sbin/init
    root     2     0   0 06:52:46 ?           0:00 pageout
    root     3     0   0 06:52:46 ?           0:00 fsflush
    root     7     1   0 06:52:47 ?           0:02 /lib/svc/bin/svc.startd
    root     9     1   0 06:52:47 ?           0:27 /lib/svc/bin/svc.configd
    root   549     1   0 10:58:06 ?           0:00 /usr/lib/inet/inetd start
    root   201     1   0 06:53:19 ?           0:00 devfsadmd
  daemon   293     1   0 06:53:40 ?           0:00 /lib/crypto/kcfd
   dladm    15     1   0 06:52:48 ?           0:00 /sbin/dlmgmtd
    root   298     1   0 06:53:44 ?           0:00 /usr/lib/picl/picld
    root   198     1   0 06:53:19 ?           0:00 /usr/lib/sysevent/syseventd
    root   554   552   0 10:58:06 ?           0:00 /usr/lib/saf/ttymon
    root   552     7   0 10:58:06 ?           0:00 /usr/lib/saf/sac -t 300
    root   580     1   0 10:58:08 ?           0:00 /usr/lib/ssh/sshd
  daemon   525     1   0 06:57:48 ?           0:00 /usr/sbin/rpcbind
    root   544     1   0 10:58:03 ?           0:00 /usr/sbin/nscd
    root   512     1   0 06:56:07 ?           0:00 /sbin/dhcpagent
    root   553     1   0 10:58:06 ?           0:00 /usr/lib/utmpd
    root   571     7   0 10:58:08 console     0:00 -bash
    root   376     1   0 06:53:54 ?           0:00 /usr/sbin/cron
    root   667   571   0 11:03:32 console     0:00 ps -ef
    root   567     1   0 10:58:08 ?           0:00 /usr/sbin/syslogd
root@unknown:~#
Anyway, here is the sequence using a -R.. This works for 99.x% of the cases.. But I would expect to fail for the same cases lu will.. i.e. say you need a new version update_drv, etc. For those cases the chroot will get you through it with some skilled sequencing.. Of course, the chroot approach can have it's own set of problems :-)
beadm create snv109
beadm mount snv109 /mnt
pkg -R /mnt set-authority -O http://pkg.opensolaris.org/dev opensolaris.org
pkg -R /mnt refresh
pkg -R /mnt install SUNWipkg
pkg -R /mnt install entire@0.5.11-0.109
bootadm update-archive -R /mnt
beadm umount snv109
beadm activate snv109

Wednesday Mar 04, 2009

Developing with multiple BEs in OpenSolaris

One of the things I have found over the years to be helpful as a Solaris kernel developer is liveupgrade (lucreate(1M)).

For example, I will usually have 3 boot environments (BE) on my systems at one time. The first BE usually has the base build that my gate is a child of. The second BE is my test BE which I BFU, and generally seem to kill quite often. And the third BE is an old build so I can go move my test BE to the old build all the way up to the most recent build (luupgrade(1M)).

OpenSolaris uses beadm which is a nice replacement to the lu commands. I was a little disappointed when I first started using it because I didn't think it was as flexible as I wanted.. But after playing with it for a while, I have a setup which I'm pretty happy with at the moment.

What I want to be able to do is to have the stock OpenSolaris BE, which gets upgraded at major releases. But I also want to have BEs which are based off of various development builds, and test BEs which I can BFU, etc. This is really nice since these will be COW based clones due to zfs.

So the question was how to update an alternate BE and at the same time, not modify the current BE. i.e. not having to update the authority, upgrade SUNWipkg, etc. The solution is pretty simple, chroot...

Here's how I create a b108 BE on a stock 2008.11 system, while ensuring I keep the 2008.11 bits unmodified. I can switch back to the stock OpenSolaris bits at any time with a beadm activate opensolaris.

beadm create snv108
beadm mount snv108 /mnt
mount -F proc /proc /mnt/proc
chroot /mnt
pkg set-authority -O http://pkg.opensolaris.org/dev opensolaris.org
pkg refresh
pkg install SUNWipkg
pkg install entire@0.5.11-0.108
bootadm update-archive
exit
umount /mnt/proc
beadm umount snv108
beadm activate snv108

Monday May 05, 2008

Installing OpenSolaris on Xen

Here are some quick instructions on how to install a DHCP based PV OpenSolaris guest/domU on a hypervisor based on the Xen open source community.

First, download the OpenSolaris CDROM.

Here's the py file I'm using... Your path to pygrub will differ if your using a linux dom0.

: alpha[1]#; cat pv.py
name = "opensolaris-pv-install"
vcpus = 1
memory = "1024"
bootloader = "/usr/lib/xen/bin/pygrub"
kernel = "/platform/i86xpv/kernel/amd64/unix"
ramdisk = "/boot/x86.microroot"
extra = "/platform/i86xpv/kernel/amd64/unix -B console=ttya,livemode=text"
disk = ['file:/tank/guests/install/opensolaris/os200805.iso,6:cdrom,r',
        'file:/tank/guests/opensolaris/disk.img,0,w']
vif = ['']
on_shutdown = "destroy"
on_reboot = "destroy"
on_crash = "preserve"
: alpha[1]#; 

Setup your paths correctly, create your disk, etc. Boot the OpenSolaris LiveCD

: alpha[1]#; xm create -c pv.py
Using config file "./pv.py".
Started domain opensolaris-pv-install
v3.1.4-xvm chgset 'Fri May 02 10:23:19 2008 -0700 15873:3e3bd3d19023'
SunOS Release 5.11 Version snv_86 64-bit
Copyright 1983-2008 Sun Microsystems, Inc.  All rights reserved.
Use is subject to license terms.
Hostname: opensolaris
Remounting root read/write
Probing for device nodes ...
Preparing live image for use
Done mounting Live image
USB keyboard
 1. Albanian                      22. Latvian                       
 2. Belarusian                    23. Macedonian                    
 3. Belgian                       24. Malta_UK                      
 4. Bulgarian                     25. Malta_US                      
 5. Croatian                      26. Norwegian                     
 6. Czech                         27. Polish                        
 7. Danish                        28. Portuguese                    
 8. Dutch                         29. Russian                       
 9. Finnish                       30. Serbia-And-Montenegro         
10. French                        31. Slovenian                     
11. French-Canadian               32. Slovakian                     
12. Hungarian                     33. Spanish                       
13. German                        34. Swedish                       
14. Greek                         35. Swiss-French                  
15. Icelandic                     36. Swiss-German                  
16. Italian                       37. Traditional-Chinese           
17. Japanese-type6                38. TurkishQ                      
18. Japanese                      39. TurkishF                      
19. Korean                        40. UK-English                    
20. Latin-American                41. US-English                    
21. Lithuanian                    
To select the keyboard layout, enter a number [default 41]:

1. Chinese - Simplified
2. Chinese - Traditional
3. English
4. French
5. German
6. Italian
7. Japanese
8. Korean
9. Portuguese - Brazil
10. Russian
11. Spanish
12. Swedish
To select the desktop language, enter a number [default 3]:
Configuring devices.
Mounting local partitions/cdroms
Reading ZFS config: done.

opensolaris console login: 
 May  5 08:06:30 opensolaris in.routed[639]: route 0.0.0.0/8 --> 0.0.0.0 nexthop ...
opensolaris console login: 

Log into LiveCD (jack/jack). Make sure your networking is up (it can take a minute or two until the DHCP client runs).

opensolaris console login: jack
Password: 
Last login: Mon May  5 08:07:00 on console
Sun Microsystems Inc.   SunOS 5.11      snv_86  January 2008
jack@opensolaris:~$ 
jack@opensolaris:~$ ifconfig xnf0
xnf0: flags=201004843 mtu 1500 index 2
        inet 192.168.0.117 netmask ffffff00 broadcast 192.168.0.255

Start up a VNC Server, connect to it on port 5901, and run through the install.

jack@opensolaris:~$ mkdir .vnc;cp .Xclients .vnc/xstartup 
jack@opensolaris:~$ vncserver

You will require a password to access your desktops.

Password:
Verify:
xauth:  creating new authority file /jack/.Xauthority

New 'opensolaris:1 ()' desktop is opensolaris:1

Starting applications specified in /jack/.vnc/xstartup
Log file is /jack/.vnc/opensolaris:1.log

Once your install has completed, create a py file for the your new guest, and you are ready to go...

name = "opensolaris"
vcpus = 1
memory = "512"
disk = ['file:/tank/guests/opensolaris/disk.img,0,w']
vif = ['']
on_shutdown = "destroy"
on_reboot = "restart"
on_crash = "destroy"

Have Fun!

Thursday Jul 19, 2007

The Latest Solaris on Xen drop

I've learned something.. Never promise to write a follow up blog. :-) In keeping with my usual raw data dump style, here are some tips and tricks...

The latest Solaris on Xen drop is out. If you haven't seen it already, head over to here.

If you don't have a vncviewer installed, you can use the one which is part of vino (remote desktop). Here's what I have on some of my systems.
$ cat `which vncviewer`
#!/bin/sh
exec java -jar /usr/share/gnome/vino/vino-client.jar  ${1+"$@"} 
When running windows XP as a guest, you can enable (crappy) sound by adding the following to your py file.
  soundhw='es1370'
The default NIC in HVM works fine, but is slowww. At this time, WinXP PV drivers aren't generally available (to speed the IO up).
  vif = [ 'type=ioemu,mac=.your mac here.' ]
Also in WinXP, if your using VNC, specifying a USB tablet is helpful for mouse tracking. This doesn't work in Linux HVM domains though (no driver available). e.g. add ...
  usb=1
  usbdevice="tablet"
Other than the NIC, these settings work equally well in Windows Vista. You need to use a different NIC to get networking to work in Vista, e.g.
  vif =  ['type=ioemu,mac=.your mac here.,model=ne2k_pci']
I have few guests on my system.

I need more memory on my system now. 4G isn't enough anymore. It takes a while to go through all my guests and install the updates weekly.

A couple of the more interesting, non traditional ones... for Ubunutu, I did a HVM install and then copied in a Linux domU kernel which I built (all static, no modules). One thing I haven't figured out yet is how to properly setup the console devices. I can't seem to get /dev/xvc0 or /dev/tty0 created even though the frontend and backend driver are present. I do get console output on xvc0 though, so it is usuable. For the PV framebuffer, I do get a connection between the frontend/backend, but can't run X since it wants to open /dev/tty0. You do see a lot of "/dev/mem: mmap: Bad address" messages which comes from /usr/sbin/dmidecode trying to call into the BIOS (which isn't support in a PV guest). Unfortunately it's not a trival thing to remove...
    root@mrj-desktop:/etc# apt-get remove dmidecode
    Reading package lists... Done
    Building dependency tree       
    Reading state information... Done
    The following packages will be REMOVED:
      acpi-support dmidecode gnome-power-manager gnome-session hotkey-setup
      laptop-detect powermanagement-interface powernowd tasksel tasksel-data
      ubuntu-desktop ubuntu-minimal ubuntu-standard
    0 upgraded, 0 newly installed, 13 to remove and 0 not upgraded.
    27 not fully installed or removed.
    Need to get 0B of archives.
    After unpacking 17.2MB disk space will be freed.
    Do you want to continue [Y/n]? n
    Abort.
    root@mrj-desktop:/etc# 
It hasn't stopped me from doing anything other than upgrading hotkey-setup.

My gentoo guest is a staging area for building my via based router image. I have a compact flash in the system with a kernel and compressed ramdisk. The ramdisk is < 256M when uncompressed. This way I don't write to the compact flash very often, and if the system is hacked, I just need to power cycle to clean it up. I'm running a real gentoo/glibc (vs ulibc) distro which is why it's so big. I have a disk image which I loop mount and chroot to. I update all the packages then umount, copy the image file to a final image file. loop mount and chroot to new image file. emerge -C the man pages and gcc, rm -rf /usr/portage, portage files in /var, and some extra docs, then gzip it and scp it to the router. I have to install more memory this way but can run any package I want which is nice..

It's nice to be able to do all this from one system now :-) It's surprising how often I fire up the other OSes to try something out or look around, etc. I just need a laptop with 8G or more of memory now... :-)

Wednesday Nov 16, 2005

New x86 rootnex code and dtrace

In snv_24, there was some significant changes to the DDI DMA routines in the x86 rootnex.

For driver writers, there's some additional visibility to what's going on with the DDI DMA interfaces via dtrace. The following is a hacked up dtrace script to get an idea of what you can see. I need to clean it up and explain what your seeing more, but for now, I'll just put it out there.

#!/usr/sbin/dtrace -Fs

sdt:rootnex:\*:rootnex-bind-prealloc
{
	@prealloc_cookies[arg0] = count();
}

sdt:rootnex:\*:rootnex-bind-alloc
{
	@ba[arg0] = count();
}

sdt:rootnex:\*:rootnex-alloc-handle
{
	@lq[probefunc] = lquantize(arg0, 0, 16384, 128);
}

sdt:rootnex:\*:rootnex-bind-fast
{
	@lq[probefunc] = lquantize(arg1, 0, 16384, 128);
	@bf[arg0] = quantize(arg2);
}

sdt:rootnex:\*:rootnex-bind-slow
{
	@lq[probefunc] = lquantize(arg1, 0, 16384, 128);
	@bs[arg0] = quantize(arg2);
}

sdt:rootnex:\*:rootnex-bind-sp-alloc
{
	@bw[arg0] = count();
}

sdt:rootnex:\*:rootnex-sync-dev
{
	@sd[arg0] = sum(arg1);
}

sdt:rootnex:\*:rootnex-sync-cpu
{
	@sc[arg0] = sum(arg1);
}

sdt:rootnex:\*:rootnex-alloc-copybuf
{
	@copybuf_alloc[arg0] = sum(arg1);
}

sdt:rootnex:\*:rootnex-sgllen-window
{
	@sgllen_window[arg0] = count();
}

sdt:rootnex:\*:rootnex-copybuf-window
{
	@copybuf_window[arg0] = count();
}

sdt:rootnex:\*:rootnex-maxxfer-window
{
	@maxxfer_window[arg0] = count();
}

fbt:unix:i_ddi_mem_alloc:entry
{
	@mem_alloc[arg0] = sum(arg2);
	@c[probefunc] = count();
}

fbt:genunix:ddi_dma_alloc_handle:entry
{
	@c[probefunc] = count();
}

fbt:genunix:ddi_dma_addr_bind_handle:entry
{
	@c[probefunc] = count();
}

fbt:genunix:ddi_dma_buf_bind_handle:entry
{
	@c[probefunc] = count();
}

fbt:genunix:ddi_dma_mem_alloc:entry
{
	@c[probefunc] = count();
}

END
{
	printf("\\n\\n");

	printf("\\nDMA Function Count\\n");
	printf("    %-26s\\tCNT\\n", "FUNCTION");
	printa("    %-26s\\t%@u\\n", @c);

	printf("\\nalloc: bytes allocated in dma alloc\\n");
	printf("    %-18s\\tCNT\\n", "DIP");
	printa("    0x%p\\t%@u\\n", @mem_alloc);

	printf("\\nbind: used pre-alloced sgl storage (fast)\\n");
	printf("    %-18s\\tCNT\\n", "DIP");
	printa("    0x%p\\t%@u\\n", @prealloc_cookies);

	printf("\\nbind: had to alloc memory to store sgl (slow)\\n");
	printf("    %-18s\\tCNT\\n", "DIP");
	printa("    0x%p\\t%@u\\n", @ba);

	printf("\\nbind: had to alloc memory for window/copybuf state (slow)\\n");	
	printf("    %-18s\\tCNT\\n", "DIP");
	printa("    0x%p\\t%@u\\n", @bw);

	printf("\\nbind: bytes allocated for copybuf\\n");
	printf("    %-18s\\tCNT\\n", "DIP");
	printa("    0x%p\\t%@u\\n", @copybuf_alloc);

	printf("\\nbind: sgllen windows\\n");
	printf("    %-18s\\tCNT\\n", "DIP");
	printa("    0x%p\\t%@u\\n", @sgllen_window);

	printf("\\nbind: copybuf windows\\n");
	printf("    %-18s\\tCNT\\n", "DIP");
	printa("    0x%p\\t%@u\\n", @copybuf_window);

	printf("\\nbind: maxxfer windows\\n");
	printf("    %-18s\\tCNT\\n", "DIP");
	printa("    0x%p\\t%@u\\n", @maxxfer_window);

	printf("\\nbind: took fastpath\\n");
	printa("    DIP = 0x%p\\tvalue=bindsize%@u\\n", @bf);

	printf("\\nbind: took slowpath\\n");
	printa("    DIP = 0x%p\\tvalue=bindsize%@u\\n", @bs);

	printf("\\nalloc/bind: outstanding handles and binds\\n");
	printa("    %s\\t%@u\\n", @lq);

	printf("\\nsync: bytes copied for sync dev\\n");
	printf("    %-18s\\tCNT\\n", "DIP");
	printa("    0x%p\\t%@u\\n", @sd);

	printf("\\nsync: bytes copied for sync cpu\\n");
	printf("    %-18s\\tCNT\\n", "DIP");
	printa("    0x%p\\t%@u\\n", @sc);
}

Monday Aug 15, 2005

Solaris x86 rootnex warning in syslog

Your probably here because you searched for "x86 rootnex warning" or something like that on google. :-) If so, you came to the right place. I'm including below, a slightly edited heads up message that I sent out when I putback the bug fix related to this warning. It gives some details on the warning.

But first, a little background. Towards the end of s10 development, I fixed a couple of old x86 rootnex bugs. One of these incorrectly passed a dma bind operation instead of failing it. This fix ended up finding a few driver bugs. Since we were getting relatively close to s10 RR, and this was a hard to diagnose problem, I ended up printing a warning when we hit this condition. We found a few 3rd party drivers with sgllen related bugs in Solaris Express after this was fixed.

I heard some folks still occasionally run into this, so I figured I'd put this info out there... Here's the rootnex code in question and here's the original bug. Below is a slightly edited version of the heads-up that I sent out..


My recent putback of:

(P1) 3001685 ddi: DMA breakup routines do not match DDI specification
     4926500 ddi: x86 DMA breakup with sgllen==1 can give ncookies!=1
     4796610 ddivs_dmae/ddi_check_dma_handle_\* 3 assertions fail due to a product bug

could cause existing buggy drivers which are thought to be functioning
correctly, to start failing.


------------------


 o description of the problem - what's wrong

   The x86 root nexus's ddi_dma_\*_bind_handle() implementation
   can wrongly return more cookies than a driver specifies is
   the maximum that it can handle (when DDI_DMA_PARTIAL is not
   specified). This can be a very serious problem, causing silent
   data corruption. There can be a lot of corner cases depending
   on memory fragmentation, making it difficult to test for.
   Unfortunately, the fix for these bugs also can/has exposed
   minor bugs in existing drivers, which could be hard to identify
   and debug. An example is the elxl driver which specified that
   it could only handle 1 cookie for the tx data buffers, but
   really could handle 2 (and was getting 2 occasionally).


 o does the problem happen on sparc?  why not?

   This problem does \*not\* happen on SPARC. This is a bug
   in the x86 rootnex driver. This part of the code is not
   shared with any SPARC nexus driver code.


 o what rootnex was doing wrong

   The rootnex driver would return success from a DMA bind
   operation when the cookie count was larger than the maximum
   the driver specified that it could handle, and the driver
   specified that it couldn't handle partial mappings.


 o what rootnex is now doing right

   The rootnex driver fails a DMA bind operation when
   the cookie count would be larger than the maximum
   that a driver specifies that it can handle, and the
   driver specifies that it cannot handle partial mappings.


 o what drivers were doing wrong

   The vast majority of drivers aren't doing anything wrong.
   There may be a few which are not handling the DMA bind
   operation correctly. For example, if a driver cannot
   handle partial DMAs, it should be able to handle the
   following # of cookies ((max possible bind size / page size) + 1).
   If it cannot handle that number of cookies, it must be
   able to expect and correctly handle a failed bind
   operation.


 o what they need to do right.

   If a driver cannot handle partial DMAs, it should be able
   to handle the following # of cookies
   ((max possible bind size / page size) + 1).
   If it cannot handle that number of cookies, it must be
   able to expect and correctly handle a failed bind
   operation.


 o does this mean a driver which was working ok before
   could now broken, because a driver may have been
   written to work around the rootnex bug?

   No. A driver didn't need to workaround the bug. It does
   mean that a driver which was working or was appearing to
   work before, could now be broken. This could be because the
   code path for the failing bind operation has never been
   tested or the driver never expected the bind operation
   to fail (i.e. the driver has a bug which they never
   debugged since it never failed before unless they noticed
   memory corruption).


 o instructions for using flags

       There are two patchables rootnex_bind_fail & rootnex_bind_warn. The
       following table explains the behavior of the new bind operation
       failure. The current default behavior is to fails the bind and
       print one warning message per major number.

 rootnex_bind_fail |  rootnex_bind_warn  | Results if sgllen < \*ccount
                   |                     | && !DDI_DMA_PARTIAL
 ---------------------------------------------------------------------
        0          |          0          | behaves like code today (bind succeeds, no warning message)
        0          |          1          | bind succeeds, print one warning/major#
        1          |          0          | fails the bind, no warning message
        1          |          1          | fails the bind, print one warning/major#

       To revert to the previous behavior, which is incorrectly returning
       success from these ddi_dma_\*_bind_handle() operations, put the
       following in /etc/system then reboot.
          set rootnex:rootnex_bind_fail = 0

       To disable the warning message, put the following in /etc/system
       then reboot.
          set rootnex:rootnex_bind_warn = 0

       Anyone fixing bugs found by the warning should make sure they test their
       fix \*without\* rootnex_bind_fail = 0 and set kmem_flags = 3f to ensure they
       clean up correctly after the failure (since it's likely that code path
       hasn't been tested). e.g. put the following in /etc/system
          set rootnex:rootnex_bind_fail = 1
          set kmem_flags = 0x3f


 o so you've got a driver's source code.  how can you
   inspect the code to check for possible errors
   and how can you fix them?

   First you must understand what is the maximum possible bind
   size that your driver will see. This may not be trivial.
   For example, the ata driver handles ~64K maximum buffers
   for normal operation, but newfs goes through /dev/rdsk
   which doesn't break buffers down to ~64K buffers initially.
   Once you understand maximum possible bind size, make sure
   sgllen is set to ((max possible bind size / page size) + 1)
   and that you actually handle that many cookies. Or make
   sure you handle the fail case correctly.


 o can you run your fixed driver on earlier solaris releases?

   Yes. A correctly written driver will run fine on both S10
   and earlier solaris releases. This bug fix only fixes a
   problem of not correctly identifying a failure case. If a
   driver doesn't generate a failure case, or correctly
   handles the failure case, there will not be any problems.


 o What can I do if my system doesn't boot anymore

   In the unlikely event your system doesn't boot anymore,
   you can recover the system by setting a defered breakpoint
   in rootnex_attach, clear the rootnex_bind_fail patchable, 
   then, once you've booted, add set rootnex:rootnex_bind_fail = 0
   to /etc/system

   An example is provide below...

     Boot args: 

     Type    b [file-name] [boot-flags]       to boot with options
     or      i                                to enter boot interpreter
     or                                       to boot with defaults

                       <<< timeout in 5 seconds >>>

     Select (b)oot or (i)nterpreter: b kmdb -d
     Loading kmdb...

     Welcome to kmdb
     kmdb: Unable to determine terminal type: assuming `vt100'
     [0]> ::bp rootnex`rootnex_attach
     [0]> :c
     SunOS Release 5.10 Version gate:2004-10-18 32-bit
     Copyright 1983-2004 Sun Microsystems, Inc.  All rights reserved.
     Use is subject to license terms.
     Loaded modules: [ ufs unix krtld genunix specfs ]
     kmdb: stop at rootnex`rootnex_attach
     kmdb: target stopped at:
     rootnex`rootnex_attach: pushl  %ebp
     [0]> rootnex`rootnex_bind_fail?W 0
     rootnex`rootnex_bind_fail:      0x1             =       0x0
     [0]> :c
     Hostname: ...
     [hostname] console login: root
     Password: 

     [CUT]
     bfu'ed from /ws/on10-gate/archives/i386/nightly-nd on 2004-10-18
     Sun Microsystems Inc.   SunOS 5.10      s10_68  December 2004
     # echo "set rootnex:rootnex_bind_fail = 0" >> /etc/system
     #

Tuesday Jun 14, 2005

Solaris x86, Device DMA, and the DDI

Solaris x86, Device DMA, and the DDI I'm going to start a monthly blog entry on a DDI subject of your choice... Now that OpenSolaris is available, I can get into some decent detail referencing kernel code when needed.

So I'm seeking a topic for July... Anyone interested, submit your DDI topic of interest in the comments and I'll pick one for July....

For June, I'm going to talk about Solaris x86, device DMA, and the DDI.. Mostly because that's what I have been spending some of my time on lately... I write code, not documents, so don't expect too much. I can't spell worth a damn... Grammar is poor. Use the wrong words a lot. I'll probably jump around a lot too :-). Hopefully, you'll still get something meaningful out of this ;-)

I'm assuming for this entry you already know a little bit about the DDI DMA interfaces in Solaris. If not, you can look at the following manpages for a little background...

  • ddi_dma_alloc_handle(9M)
  • ddi_dma_free_handle(9M)
  • ddi_dma_addr_bind_handle(9M)
  • ddi_dma_buf_bind_handle(9M)
  • ddi_dma_unbind_handle(9M)
  • ddi_dma_sync(9M)
  • ddi_dma_getwin(9M)
  • ddi_dma_nextcookie(9M)
  • ddi_dma_attr(9S)
  • ddi_dma_cookie(9S)

The implementation of these routines live in sunddi.c where sunddi.o resides in genunix (/kernel/amd64/genunix). You'll see when you look at this code, that most of these routines are just simple wrappers which will eventually end up in architecture specific code.

Jumping ahead a little, on x86, we end up in the rootnex driver. The rootnex driver is the x86 root nexus driver. Nexus drivers implement the busops interface in dev_ops. Basically, drivers are hierarchical where nexus drivers can have children which could be either other nexus drivers or leaf drivers. A leaf driver is the last driver in a branch (i.e. can't have any children). The root nexus driver is the root driver, similar to the root of a filesystem. Anyway, that's a subject of another entry. For now, just trust me that we end up in rootnex for x86 :-)

So a quick mapping of code is:

ddi_dma_nextcookie stays in genunix...

Now let me be the first to say this is pretty rough code... This should be changing soon, but for now, you have been warned... So now that you know where the code is, I'm going to jump back up to a higher level...

If your still interested, you probably already know the normal sequence is to alloc a dma handle, bind a buffer, get your cookies (physical addresses to DMA into), sync if reading from memory, do your DMA, etc...

When a dma handle is allocated, the rootnex driver will do some validation on the ddi_dma_attr and pre-allocate some state for you. Nothing very exciting... The fun stuff happens in the bind code which will be the topic for the rest of this entry.. Instead of walking through the code, I'll walk through the concepts... The code should be changing soon, so I don't want to spend a lot of time on code which may not be the same by the time you read this. But first some terminology I'll be using which doesn't always match up with other folks terminology. Sometimes I like to redefine things too :-).

  • Scatter/Gather List (SGL) - a list of physically contiguous buffers
  • Cookie - single physically contiguous buffer i.e. a SGL element.
  • SGL Length - The maximum number of cookies/SGL elements the DMA engine supports
  • Copy Buffer - bounce buffer/intermediate buffer. Used as a temporary buffer to DMA to/from when the DMA engine can't reach the physical address we are supposed to MDA into.

Jumping to the fun stuff, the first concept in the bind, is how the buffer is passed down to bind. It can be a kernel virtual address (KVA) w/ size, a linked list of physical pages (without a kernel address), or an array of physical pages (with a kernel virtual address [shadow I/O]). For each page in the buffer, the rootnex driver has to make sure that the dma engine can reach the physical address. There is a DMA engine low address limit and a high address limit passed in the ddi_dma_attr during ddi_dma_alloc_handle(9M) which the rootnex driver uses to do this.

For every page which can't be reached, the rootnex driver will use part of a copy buffer. For these pages, the device will DMA into the copy buffer, and not the actual buffer. The data will be copied to/from the copy buffer when the driver calls ddi_dma_sync(9F). So the driver better make sure they have syncs in the right place and have the direction correct! Continuing... The copy buffer has a fixed maximum size. Each bind will get its own copy buffer if needed. If the amount of copybuf required in a single bind is greater than the maximum size of a copy buffer, the bind will need to be a partial bind and will require multiple windows. This is a concept I'll talk about further down..

What happens when a linked list of physical pages w/o a KVA comes down you asked? Good question! Well, currently, the rootnex driver will allocate some KVA space (vmem_alloc) without physical memory to back it up and then maps it to the physical page on the fly during sync. Not pretty. This should be changing for the 64-bit kernel in the near future (homework: what is seg kpm). How come the DMA engine can reach the copy buffer, but can't reach the original DMA buffer you ask? You guys are good... Well, most DMA buffers originate from userland or from a kernel stack which has no idea what the constraints of the DMA engine are (and it shouldn't since there may be multiple DMA engines with different constraints). The copy buffer is allocated from the same underlying routines that ddi_dma_mem_alloc(9F) uses, which takes into account the DMA engines constraints. i.e. the copy buffer is allocated specifically for the DMA engine we are using...

The copybuf code path got, and is still getting, a lot of usage in s10 and above once we went to a 64-bit kernel on x86. The number of x86 machines with > 4G of memory has gone up tremendously since you can actually use the memory more efficiently these days. OK, maybe efficiently isn't the right word, but you get my point...;-) A lot of devices only support 32-bit DMA addresses, so they correctly set their DMA high address to 0xFFFF.FFFF. Any physical address above this will require a copy buffer on x86 (On SPARC, we have an IOMMU so it doesn't have this problem, but that's a different entry).

Jumping to the side for a sec... don't confuse 64-bit DMA addresses with a 64-bit card. You may have a 32-bit/33MHz PCI card which supports 64-bit address via dual address cycles (DAC), you may have a 64-bit/66Mhz PCI card which only supports 32-bit DMA addresses, or you could have a x8 PCI Express card which only supports 32-bit DMA addresses. The speed of the card and the number of bytes that can be transfered in a clock have nothing to do with the DMA address width. If a device only supports a 32-bit DMA address, it will not be able to reach memory above 4G and will require a copy buffer.

Jumping back. It gets more interesting from here. Memory organization on SMP Opteron systems is very similar to our SPARC systems. The memory controller is in the CPU chip (which could have multiple cores). So if I have a two chip Opteron based system, I have at least 2 memory controllers. Solaris is smart, and will allocate memory closest to the core you are running on. Going back to the 2 chip Opteron system. If the system has 16G of memory, and I am a process running on chip 2, when I allocate memory, it's physical address will be above 8G (0 - 8G is attached to chip 1). So all I/O on chip 2 will need a copy buffer for a DMA engine with a high address limit of 0xFFFF.FFFF. Lessons learned, if you want performance on this type of system, use a device which supports 64-bit DMA addresses. And make sure if your device supports 64-bit DMA addresses, the driver supports 64-bit DMA addresses!

OK, enough about copy buffers. Jumping back to ddi_dma_attr for a moment. dma_attr_align is used during ddi_dma_mem_alloc(9F), don't expect it to do anything for you in the bind. dma_attr_count_max and dma_attr_seg limit the size of a cookie. If I have a 1M buffer which is physically contiguous, normally I would get a sgl length of 1 and the single cookie would be 1M in size. If I set seg or count_max to 256K-1, I would get a sgl length of 4 or 5 (depending on if the start address was page aligned) where each cookie would be <= 256K in size. Why do we have both seg and count_max? don't know...

OK, we finally arrive at windows... The fun stops here. Basically, a window is supposed to be a piece of a DMA bind that fits within the DMA constraints. i.e. if I have a bind for which the DMA engine cannot handle in a single transfer, and the driver/device supports partial mappings, the DDI is supposed to break it into multiple windows where each window can be handled by the DMA engine. Again, jumping back to ddi_dma_attr . There are three things which should require the use of multiple windows during a bind:

  • We need more copybuf space then the maximum copy buffer size allowed
  • The number of cookies required to bind the buffer is greater than the maximum number of cookies the H/W can handle (dma_attr_sgllen)
  • The size of the bind is greater than the maximum transfer size of the DMA engine (dma_attr_maxxfer)

But, from a historical note, that's not the way it was original implemented on the original x86 port. At the time this was written, the only time you will get multiple windows is when we need more copybuf space then the maximum copy buffer size allowed. This should be fixed shortly, but you will still have to handle how the current implementation works for the driver to operate correctly on s10 and before (I'll explain what that behavior is shortly). Don't worry though, once this is fixed, a driver which handles the old behavior will still work great with the correct behavior.

Once we need multiple windows, the rootnex now has to worry about the granularity of the device (dma_attr_granular). A device can only transfer data in even multiples of the granularity. e.g. if the granularity is set to 512, the size of a window must be an even multiple of 512. So when the rootnex gets to the end of a window, it sometimes has to subtract some data from the current window and put it into the next window to ensure the current window size is a multiple of granularity. This is referred to as trimming in the code. This gets pretty complicated with the way the rootnex DMA code is currently architected, and was the source of a fair number of bugs for which I had to put some not so obvious hacks in there to fix..

And last, but not least, what happens today in a bind when the driver supports a partial bind and one of the two conditions are hit:

  • The number of cookies required to bind the buffer is greater than the maximum number of cookies the H/W can handle (dma_attr_sgllen)
  • The size of the bind is greater than the maximum transfer size of the DMA engine (dma_attr_maxxfer)

Tune in next week..

Sorry couldn't resist.. Well, I don't think there's an official word for it, but I'm going to make up something as I type, because, remember, I like to do that sort of thing ;-). We get a superwindow, where a superwindow is a window larger than the DMA engine can handle. However, a superwindow is properly trimmed at the conditions mentioned above. So when the driver is going through the cookies, if the next cookie puts it over the DMA engines sgllen or maxxfer size, it can consider that cookie the start of the next window. So it puts more work back on the driver writer. Of course, if you've already written a Solaris driver for x86 which supports partial mappings, you have probably already figured that out :-/.

Well, that's enough for this month. I have code to finish up and putback. Don't forget to submit your DDI topic of interest in the comments section for next month...

MRJ

Technorati Tag:
Technorati Tag:

Tuesday Jun 22, 2004

What devices do you want supported & Writing Solaris Drivers

Looking for opinions out there... Mostly concentrating on x86, but SPARC comments welcome too.

  • For those of you who are, or would like to run Solaris x86, what's the top device that you would like to see supported?
  • For those of you that write, or would like to write Solaris device drivers, what could we do to make it easier for you?

Thanks!

MRJ

Tuesday Jun 08, 2004

Good place to get Solaris apps

Here's a really useful site...

http://www.blastwave.org/

nice to have pkg-get w/ dependencies on Solaris... :-)

Solaris really needs a standard app which is as easy to use as pkg-get and up2date...

MRJ

About

mrj

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today