Thursday Mar 31, 2011

KSSL presentation

Way back I presented about KSSL in CZOSUG meeting. The presentation seems to be gone from opensolaris.org infrastructure so I have copied the file here.

Wednesday Jun 02, 2010

netcat and network stack debugging

One of the options of the netcat program (/usr/bin/nc) available in OpenSolaris (if you don't have it installed simply run pkg install SUNWnetcat) is the -D (aka debugging) option. Only recently I realized that not everyone out there knows how to capture the debugging data once this option is set since this is pretty Solaris specific. What this option does is basically trigger a flag inside the ip kernel module specific for given connection structure (conn_t). netcat does this by calling setsockopt() system call with SO_DEBUG option set for given socket. As a result conn_debug flag is set inside conn_t structure associated with the socket. This flag is then consulted in various functions manipulating the structure.

When there is an interesting event and the conn_debug is set the function calls strlog() kernel function to record the data. For example, here's a snippet of usr/src/uts/common/inet/tcp/tcp_input.c:tcp_input_listener():

   1366 	if (listener->tcp_conn_req_cnt_q >= listener->tcp_conn_req_max) {
   1367 		mutex_exit(&listener->tcp_eager_lock);
   1368 		TCP_STAT(tcps, tcp_listendrop);
   1369 		TCPS_BUMP_MIB(tcps, tcpListenDrop);
   1370 		if (lconnp->conn_debug) {
   1371 			(void) strlog(TCP_MOD_ID, 0, 1, SL_TRACE|SL_ERROR,
   1372 			    "tcp_input_listener: listen backlog (max=%d) "
   1373 			    "overflow (%d pending) on %s",
   1374 			    listener->tcp_conn_req_max,
   1375 			    listener->tcp_conn_req_cnt_q,
   1376 			    tcp_display(listener, NULL, DISP_PORT_ONLY));
   1377 		}
   1378 		goto error2;
   1379 	}

To capture the data logged via strlog it's necessary to know the STREAMS module ID, which in our case is TCP_MOD_ID which is defined in usr/src/uts/common/inet/tcp_impl.h as 5105.

To read the data one can use either strace(1M) command line tool or strerr(1M) daemon which both produce text logs. To read everything, one can use this command (needs read access to /dev/log so has to run under root):

# strace 5105 all all

Here are two examples using netcat: at first we try to bind to a port which we don't have privileges for:

$ nc -D -l -4 -p 23
nc: Permission denied

which produces the following entry from strace:

000004 12:11:04 19c7ff6a  1 ..E 5105 0 ip: [ID 234745 kern.debug] tcp_bind: no priv for port 23

next we try to bind to already occupied port:

$ nc -D -l -4 -p 4444
nc: Address already in use

which produces the following entry from strace:

000005 12:15:33 19c86878  1 ..E 5105 0 ip: [ID 326978 kern.debug] tcp_bind: requested addr busy

This is of course traceable via normal tools such as ppriv(1) or truss(1) but the point is that much fine grained details can be captured from the network modules. The format of the log entries is explained in the strace(1M) man page.

Thursday Dec 10, 2009

ZFS likes to have ECC RAM

I have been using custom built {ZFS,OpenSolaris}-based NAS at home for more than a year. The machine was built partly from second hand components (e.g. motherboard), from in-house unused iron and from minority of brand new stuff (more on that in a separate entry). The machine has been running constantly and serving data occasionally with very light load. One day I needed to perform some administrative task and realized it's not possible to SSH into the machine. Console login revealed the uptime is just couple of days, both pools (root pool and data pool) contain staggering number of checksum errors. In the /var/crash/ directory there was couple of crash dumps. Some of them were corrupted and mdb(1) refused to load them in or reported garbage. The times of the crashes corresponded to the Sunday night scrubbing for each of the pool. At least two of the dumps contained interesting and fairly obvious stack trace. I no longer have the file so here's just the entry from the log:

Nov  1 02:27:20 chiba \^Mpanic[cpu0]/thread=ffffff0190914040: 
Nov  1 02:27:20 chiba genunix: [ID 683410 kern.notice] BAD TRAP: type=d (#gp General protection) rp=ffffff0006822380 addr=488b
Nov  1 02:27:20 chiba unix: [ID 100000 kern.notice] 
Nov  1 02:27:20 chiba unix: [ID 839527 kern.notice] sh: 
Nov  1 02:27:20 chiba unix: [ID 753105 kern.notice] #gp General protection
Nov  1 02:27:20 chiba unix: [ID 358286 kern.notice] addr=0x488b
Nov  1 02:27:20 chiba unix: [ID 243837 kern.notice] pid=740, pc=0xfffffffffba0373a, sp=0xffffff0006822470, eflags=0x10206
Nov  1 02:27:20 chiba unix: [ID 211416 kern.notice] cr0: 8005003b cr4: 6f8
Nov  1 02:27:20 chiba unix: [ID 624947 kern.notice] cr2: fee86fa8
Nov  1 02:27:20 chiba unix: [ID 625075 kern.notice] cr3: b96a0000
Nov  1 02:27:20 chiba unix: [ID 625715 kern.notice] cr8: c
Nov  1 02:27:20 chiba unix: [ID 100000 kern.notice] 
Nov  1 02:27:20 chiba unix: [ID 592667 kern.notice]     rdi: ffffff018b1e1c98 rsi: ffffff01a032dfb8 rdx: ffffff0190914040
Nov  1 02:27:20 chiba unix: [ID 592667 kern.notice]     rcx: ffffff018ef054b0  r8:                c  r9:                b
Nov  1 02:27:20 chiba unix: [ID 592667 kern.notice]     rax: ffffff01a032dfb8 rbx:                0 rbp: ffffff00068224a0
Nov  1 02:27:20 chiba unix: [ID 592667 kern.notice]     r10:                0 r11:                0 r12: ffbbff01a032d740
Nov  1 02:27:20 chiba unix: [ID 592667 kern.notice]     r13: ffffff01a032dfb8 r14: ffffff018b1e1c98 r15:             488b
Nov  1 02:27:20 chiba unix: [ID 592667 kern.notice]     fsb:                0 gsb: fffffffffbc30400  ds:               4b
Nov  1 02:27:20 chiba unix: [ID 592667 kern.notice]      es:               4b  fs:                0  gs:              1c3
Nov  1 02:27:20 chiba unix: [ID 592667 kern.notice]     trp:                d err:                0 rip: fffffffffba0373a
Nov  1 02:27:20 chiba unix: [ID 592667 kern.notice]      cs:               30 rfl:            10206 rsp: ffffff0006822470
Nov  1 02:27:20 chiba unix: [ID 266532 kern.notice]      ss:               38
Nov  1 02:27:20 chiba unix: [ID 100000 kern.notice] 
Nov  1 02:27:20 chiba genunix: [ID 655072 kern.notice] ffffff0006822260 unix:die+10f ()
Nov  1 02:27:20 chiba genunix: [ID 655072 kern.notice] ffffff0006822370 unix:trap+43e ()
Nov  1 02:27:20 chiba genunix: [ID 655072 kern.notice] ffffff0006822380 unix:_cmntrap+e6 ()
Nov  1 02:27:20 chiba genunix: [ID 655072 kern.notice] ffffff00068224a0 genunix:kmem_slab_alloc_impl+3a ()
Nov  1 02:27:20 chiba genunix: [ID 655072 kern.notice] ffffff00068224f0 genunix:kmem_slab_alloc+a1 ()
Nov  1 02:27:20 chiba genunix: [ID 655072 kern.notice] ffffff0006822550 genunix:kmem_cache_alloc+130 ()
Nov  1 02:27:20 chiba genunix: [ID 655072 kern.notice] ffffff00068225c0 zfs:dbuf_create+4e ()
Nov  1 02:27:20 chiba genunix: [ID 655072 kern.notice] ffffff00068225e0 zfs:dbuf_create_bonus+2a ()
Nov  1 02:27:20 chiba genunix: [ID 655072 kern.notice] ffffff0006822630 zfs:dmu_bonus_hold+7e ()
Nov  1 02:27:20 chiba genunix: [ID 655072 kern.notice] ffffff00068226c0 zfs:zfs_zget+5a ()
Nov  1 02:27:20 chiba genunix: [ID 655072 kern.notice] ffffff0006822780 zfs:zfs_dirent_lock+3fc ()
Nov  1 02:27:20 chiba genunix: [ID 655072 kern.notice] ffffff0006822820 zfs:zfs_dirlook+d9 ()
Nov  1 02:27:20 chiba genunix: [ID 655072 kern.notice] ffffff00068228a0 zfs:zfs_lookup+25f ()
Nov  1 02:27:20 chiba genunix: [ID 655072 kern.notice] ffffff0006822940 genunix:fop_lookup+ed ()
Nov  1 02:27:20 chiba genunix: [ID 655072 kern.notice] ffffff0006822b80 genunix:lookuppnvp+3a3 ()
Nov  1 02:27:20 chiba genunix: [ID 655072 kern.notice] ffffff0006822c20 genunix:lookuppnatcred+11b ()
Nov  1 02:27:20 chiba genunix: [ID 655072 kern.notice] ffffff0006822c90 genunix:lookuppn+5c ()
Nov  1 02:27:20 chiba genunix: [ID 655072 kern.notice] ffffff0006822e90 genunix:exec_common+1ac ()
Nov  1 02:27:20 chiba genunix: [ID 655072 kern.notice] ffffff0006822ec0 genunix:exece+1f ()
Nov  1 02:27:20 chiba genunix: [ID 655072 kern.notice] ffffff0006822f10 unix:brand_sys_syscall32+19d ()
Nov  1 02:27:20 chiba unix: [ID 100000 kern.notice] 
Nov  1 02:27:20 chiba genunix: [ID 672855 kern.notice] syncing file systems...

Also, next to the messages on the console I found some entries in /var/adm/messages like this one:

Nov  2 12:15:01 chiba genunix: [ID 647144 kern.notice] ksh93: Cannot read /lib/amd64/ld.so.1

Later on, the condition of the machine worsened and it was not even possible to execute some commands due to I/O errors up to the point when the machine had to be halted.

The panic occurring in kmem routines, loads of checksum errors on both mirrored pools (same number of errors for each disk in the mirror) and the fact that the system was running with the same build for couple of months without a problem lead me to try memtest:

The errors started appearing on the screen in the first couple of seconds of the run. It turned out one of the 3 1GB DDR2 chips went bad. In case you're wondering, the DIMMS were bought as new 1 year ago, were branded (all of them from the same brand known for gaming/overclocking equipment, same type) and had aluminium heat sink on it, so no low quality stuff.

I was able to recover the data from past snapshots and replaced the RAM with ECC DIMMS (which required new motherboard+CPU combo). This is nice case of semi-silent data corruption detection. Without checksums the machine would be happily panicking and corrupting data without giving clear indication what is going on (e.g. which files were corrupted). So, even for home NAS solution ECC RAM is good (if not essential) to have.

FMA should do the right thing if one of the ECC modules goes bad which means it will not allow the bad pages to be used (the pages will be retired). The list of retired pages is persistent across reboots. More on FMA and ECC RAM can be found e.g. in this discussion on fm-discuss or in the FMA and DIMM serial numbers blog entry in Rob Johnston's blog or in the Eversholt rules for AMD in usr/src/cmd/fm/eversholt/files/i386/i86pc/amd64.esc.

Friday Oct 30, 2009

signal() versus sigaction() on Solaris

This entry is mostly for newcomers to Solaris/OpenSolaris from UNIX-like systems. When I had been taught about signal() and sigaction() my understanding was that sigaction() is just a superset of signal() and also POSIX conformant but otherwise they accomplish the same thing. This is indeed the case for some of UNIX-like operating systems. In Solaris, as I only recently discovered (to my dismay :)), it's different.

Consider the following code (please ignore the fact it's not strictly checking return values and that the signal handler is not safe):

#include <stdio.h>
#include <signal.h>
#include <unistd.h>
#include <sys/types.h>

void sig_handler(int s) {
	printf("Got signal! Sleeping.\\n");

	sleep(10);

        printf("returning from signal handler\\n");
}

int main(void) {
        struct sigaction s_action;

	printf("Setting signal handler: ");
#ifdef POSIX_SIGNALS
	printf("sigaction\\n");
	(void) sigemptyset(&s_action.sa_mask);
	s_action.sa_handler = sig_handler;
        s_action.sa_flags = 0;
	(void) sigaction(SIGHUP, &s_action, (struct sigaction \*) NULL);
#else
	printf("signal\\n");
	signal(SIGHUP, sig_handler);
#endif

	printf("Waiting for signal\\n");
	while(1)
		pause();

	return (0);
}

Now try to compile and run with and without the -DPOSIX_SIGNALS and send 2 SIGHUP signals to the process within the 10 seconds window (so the second signal is received while the signal handler is still running). With sigaction(), the signal will be caught by the handler in both of the cases. With signal() however, the second signal will cause the process to exit. This is because kernel will reset the signal handler to default upon receiving the signal for the first time. This is described in the signal(3C) man page in a somewhat hidden sentence inside the second paragraph (it really pays out to read man pages slowly and with attention to detail):

     If signal()  is
     used,  disp  is  the address of a signal handler, and sig is
     not  SIGILL, SIGTRAP, or  SIGPWR, the system first sets  the
     signal's disposition to  SIG_DFL before executing the signal
     handler.

The sigaction(2) man page has this section:

     SA_RESETHAND    If set and the signal is caught, the  dispo-
                     sition of the signal is reset to SIG_DFL and
                     the signal will not be blocked on  entry  to
                     the  signal  handler  (SIGILL,  SIGTRAP, and
                     SIGPWR cannot be  automatically  reset  when
                     delivered; the system silently enforces this
                     restriction).

sigaction() does not set the flag by default which results in the different behavior. I found out that this behavior has been present since Solaris 2.0 or so.

In fact, signal() routine from libc is implemented via sigaction(). From $SRC/lib/libc/port/sys/signal.c:

     58 /\*
     59  \* SVr3.x signal compatibility routines. They are now
     60  \* implemented as library routines instead of system
     61  \* calls.
     62  \*/
     63 
     64 void(\*
     65 signal(int sig, void(\*func)(int)))(int)
     66 {
     67 	struct sigaction nact;
     68 	struct sigaction oact;
     69 
     70 	CHECK_SIG(sig, SIG_ERR);
     71 
     72 	nact.sa_handler = func;
     73 	nact.sa_flags = SA_RESETHAND|SA_NODEFER;
     74 	(void) sigemptyset(&nact.sa_mask);
     75 
     76 	/\*
     77 	 \* Pay special attention if sig is SIGCHLD and
     78 	 \* the disposition is SIG_IGN, per sysV signal man page.
     79 	 \*/
     80 	if (sig == SIGCHLD) {
     81 		nact.sa_flags |= SA_NOCLDSTOP;
     82 		if (func == SIG_IGN)
     83 			nact.sa_flags |= SA_NOCLDWAIT;
     84 	}
     85 
     86 	if (STOPDEFAULT(sig))
     87 		nact.sa_flags |= SA_RESTART;
     88 
     89 	if (sigaction(sig, &nact, &oact) < 0)
     90 		return (SIG_ERR);
     91 
     92 	return (oact.sa_handler);
     93 }

I am pretty sure that the SA_RESETHAND flag is set in signal() in order to preserve backwards compatibility.

This means that to solve this problem with signal(), one should set the signal handler again in the signal handler itself. However, this is not a complete solution since there is still a window where the signal can be delivered and the handler is set to SIG_DFL - the default handler which is exit in case of SIGHUP as the signal.h(3HEAD) man page explains in really useful table:

          Name        Value   Default                    Event
     SIGHUP           1       Exit       Hangup (see termio(7I))
     ...

Now let's look at FreeBSD. Its SIGNAL(3) man page contains this separate paragraph:

     The handled signal is unblocked when the function returns and the process
     continues from where it left off when the signal occurred.  Unlike previ-
     ous signal facilities, the handler func() remains installed after a sig-
     nal has been delivered.

The second sentence is actually printed in bold letters. I also tried on Linux and NetBSD and the behavior is the same as in FreeBSD.

So, to conclude all of the above: using signal() is really not portable.

Monday Aug 17, 2009

OpenSSL PKCS#11 engine presentation

Some light intro first: OpenSSL has a concept of plugins/add-ons called 'engines' which can supply alternative implementation of crypto operations (digests, symmetric and asymmetric ciphers and random data generation). The main reason for the existence of the engines is the ability to offload crypto ops to hardware. (Open)Solaris ships with an engine called PKCS#11 engine which provides access to Solaris Cryptographic Framework which in turn can provide access to HW crypto.

I spent some time fixing bugs in OpenSSL PKCS#11 engine in Solaris so I got quite intimate with its internals. Recently while discussing an upcoming feature with Jan he asked me why one particular detail in the engine is done one way and not the other (it's the fork() detection not done via atfork handlers; for the curious). It took me some thinking to find the answer (I focused on the other changes at that time) which made us realize that it would be good to summarize the design choices behind the engine and also to document the internals so that others can quickly see what's going on inside and also be able to do changes in the engine without reverse engineer the thoughts behind it. The outcome is a set of slides which I hope succinctly describe both the overall picture and the gritty details.

The presentation can be downloaded here.

Wednesday Jun 03, 2009

KSSL project is now open

I have spent some time fixing bugs in KSSL (kernel SSL proxy) implementation in Solaris and got familar with it (and the KSSL development team) so with delight I can co-announce that the KSSL project has been opened on opensolaris.org.

To me, KSSL is one of the unique projects in the (Open)Solaris security land in a sense it is tightly integrated into the system and is a consumer of several major subsystems (networking, crypto framework) which makes it interesting for study and also for extending it in creative ways.

We will start adding more content to the pages, including design documentation and description of KSSL internals. Also, this marks major milestone in a way how KSSL team does its job. From now on all non-confidential discussions, reviews etc. will happen in the open. Feel free to join the project and participate if you're interested ! (become an observer and join the mailing list)

Monday May 04, 2009

External contributions to testing community

I have just integrated couple of changes which I believe are the first contributed externally to the Testing community as an open-source contribution. The changes add couple of new tests to the nc test suite to cover the enhancement described in PSARC/2008/680 (which is present in Nevada since build 106). This is the stuff which allows you to run nc(1) in client mode with complex portlist specifications. Previously it was possible only to use simple port ranges like 22-80, with this change one can connect to e.g. 22,24,50-80,66,1024-2048. Little example how it might be useful:

$ nc -v -z grok.czech 22,25,80-88,8080
Connection to grok.czech 22 port [tcp/ssh] succeeded!
nc: connect to 129.157.71.49 port 25 [host grok.czech] (tcp) failed: Connection refused
Connection to grok.czech 80 port [tcp/\*] succeeded!
nc: connect to 129.157.71.49 port 81 [host grok.czech] (tcp) failed: Connection refused
nc: connect to 129.157.71.49 port 82 [host grok.czech] (tcp) failed: Connection refused
nc: connect to 129.157.71.49 port 83 [host grok.czech] (tcp) failed: Connection refused
nc: connect to 129.157.71.49 port 84 [host grok.czech] (tcp) failed: Connection refused
nc: connect to 129.157.71.49 port 85 [host grok.czech] (tcp) failed: Connection refused
nc: connect to 129.157.71.49 port 86 [host grok.czech] (tcp) failed: Connection refused
nc: connect to 129.157.71.49 port 87 [host grok.czech] (tcp) failed: Connection refused
nc: connect to 129.157.71.49 port 88 [host grok.czech] (tcp) failed: Connection refused
Connection to grok.czech 8080 port [tcp/\*] succeeded!

Back to the testing part. The putback (yes, stcnv-gate is still using Teamware) log for this change looks like this (I have modified Erik's e-mail a bit):

6786859 portranges_complex_spec is missing the listener
6754842 extended port list specification needs to be tested
Code contributed by Erik Trauschke <erik.trauschke AT freenet.de>

I think this is really nice example of the ideal state - the contributor not only did the feature part but also the testing part. It shows a great degree of responsibility - not just throwing some code "over the fence" but fully participating in the process to ensure the quality even in the long term.

The tests are both positive and negative. Each purpose in portranges directory is numbered and the following numbers match the test purpose numbers:

  • 5-12 ensure nc treats ports 0 and 65536 as invalid
    Previously, it was possible to listen on ports 0 and 65536, with Erik's changes this is no longer true so we need to add regression tests for both cases (client/server) and both ports.
  • 13-19 see if various malformed port list specifications are considered as invalid
    Each purpose needs not only positive tests which make sure the functionality actually works but also negative tests which ensure it does not misbehave. In this case, invalid port list specifications are thrown at nc(1) to see it reacts accordingly (with error, that is).
  • 20-25 test the functionality of complex port lists
    This is the bunch of test which see if the functionality actually works.
  • 26 tests reallocation
    Since the internal representation of the port list is now dynamically allocated and there is a default pre-allocated value which is reallocated if needed we need to test the case of reallocation.

To be able to do such integration there is now a Test development process. It's similar to the process used in ON community but it's more lightweight. The main difference is that the request-sponsor part is done informally via the testing-discuss mailing list and there is no list of bugs to pick up from. But don't be shy, whether you're adding new functionality or completely new program, the Testing community is here to help you.

Tuesday Mar 31, 2009

Command line history and editing for IPsec utilities

Since the days when John Beck added command line editing to zonecfg Mark Phalan did similar thing to Kerberos utilities and Huie-Ying Lee to sftp. IPsec utilities (ipseckey(1M) and ikeadm(1M)) offered the ability to enter commands in interactive mode for a long time but only since Nevada build 112, the commands support command line editing and history too. Again, thanks to libtecla (shipped with Solaris/OpenSolaris).

Lessons learned:

  • adding full-blown command line editing support is hard.
    Adding the initial support is quite easy. However, more advanced features could require substantial work. This is especially true for tab completion. For sftp Huie-Ying decided to add tab completion in the future phase because of the ambiguities when completing names of files (when to complete local files versus remote files).
    I did the same with tab completion for IPsec utilities - the integration only delivers basic command line editing support, without tab completion. The problem with ipseckey(1M) and ikeadm(1M) is that their grammar is quite bifurcated and has contexts. For example, you cannot use encr_alg with AH SAs in ipseckey. Or, it would be erroneous to tab complete a valid command in the middle of entering a key if the key hex sequence contained sub-string of a valid command. The hardest part is I think offering the right tables of valid commands in given context. E.g. in our case a command line in our case must start with top-level command. Each top-level command offers several valid sub-commands and we do not offer invalid sub-commands for given top-level command so there is a necessity to track the state of the finite state machine describing the grammar contexts.
    Also, after the user entered src we do not want to allow him to enter it again on the same command line. Also, if the user already entered say add esp spi we are expecting SPI number, not a command name.
    Ideally, to solve this problem in nice way there should be a meta library (or additional API in libtecla) which would offer the ability to link command tables and set the contexts.
  • interruptible cycles in command line mode
    ipseckey's monitor command reads from a PF_KEY socket in a loop. The loop is normally interruptible by SIGINT. To do so in libtecla environment (we do not want to exit the command line upon SIGINT and yet still need to interrupt the cycle), something like this is needed:
         static void
         monitor_catch(int signal)
         {
         if (!interactive)
                 errx(signal, gettext("Bailing on signal %d."), signal);
         }
    
         void
         doreadcycle(void)
         {
         ...
            
         /\* Catch \^C. \*/
         newsig.sa_handler = monitor_catch;
         newsig.sa_flags = 0;
         (void) sigemptyset(&newsig.sa_mask);
         (void) sigaddset(&newsig.sa_mask, SIGINT);
         (void) sigaction(SIGINT, &newsig, &oldsig);
    
         for (; ; ) {
                 rc = read(keysock, samsg, sizeof (get_buffer));
                 /\* handle the data \*/
         }
    
         /\* restore old behavior \*/
         if (interactive)
                 (void) sigaction(SIGINT, &oldsig, NULL);
         }
    
  • interaction with SMF
    While it's fine to bail out in interactive mode with error, due to the nature of IPsec commands (they can read the config files using the same routines as for interactive mode and they are used as SMF services to bring up IPsec policy and keys after boot) we need to distinguish the interactive and non-interactive mode.
  • maximum command line history value
    It seems that the second parameter to new_GetLine() - histlen is commonly misunderstood. This variable does not express the number of maximum lines in the history but instead maximum size of the history buffer in bytes. If the buffer becomes full, libtecla does not trim the last line but shifts instead.
    Given the first parameter to new_GetLine() expresses maximum command line size (in bytes) one needs to do some calculations and estimates on what will be needed too avoid too big buffer - ipseckey is used to enter key material so the line could become quite long. Say we wanted to keep 1024 lines. If the maximum length of the line is 1024 this will give us 1 megabyte buffer which seems too much for a simple application. Thus I did some guessing and set the buffer size accordingly:
    For "common" ipseckey configuration commands (think moderately bifurcated 'add') it's cca 300 characters. Mostly however, the users enter query commands like 'flush esp', 'dump ah' and the like so this is somewhere around say 30 characters. Say 30% of the commands are configuration and the rest is queries. To hold 100 such commands only cca 10K memory is required. In the end I chose 64K to be able to hold 15 of the biggies (4K) commands.

Thursday Nov 20, 2008

Testing netcat

After multiple rounds of code review the netcat (or nc) test suite is now finally in the onnv-stc2gate. The test suite has its home in the OpenSolaris Networking community (see the networking tests page for the list of networking test suites).
The source code is present in the src/suites/net/nc/ directory and SUNWstc-netcat packages can be downloaded from OpenSolaris Download center.

Before I go further, this is how it looks like when the test suite is run (the output is trimmed a bit):

vk:honeymooners:/opt/SUNWstc-nc$ run_test nc
Validating Arguments...
New TET_ROOT for this run : /var/tmp/honeymooners_27828
The results will be available in /var/tmp/results.27828
tcc: journal file is /var/tmp/results.27828/testlog
12:45:57  Execute /tests/dflag/tc_dflag
12:46:04  Execute /tests/hflag/tc_hflag
12:46:05  Execute /tests/kflag/tc_kflag
12:46:11  Execute /tests/nflag/tc_nflag
12:46:15  Execute /tests/portranges/tc_portranges
12:46:23  Execute /tests/pflag/tc_pflag
12:46:26  Execute /tests/sflag/tc_sflag
12:46:35  Execute /tests/Uflag/tc_Uflag
12:46:36  Execute /tests/vflag/tc_vflag
12:46:43  Execute /tests/zflag/tc_zflag
12:46:46  Execute /tests/iflag/tc_iflag
12:46:59  Execute /tests/lflag/tc_lflag
12:47:29  Execute /tests/rflag/tc_rflag
12:48:16  Execute /tests/Tflag/tc_Tflag
12:48:33  Execute /tests/uflag/tc_uflag
12:48:50  Execute /tests/wflag/tc_wflag
##################################################
TC /tests/dflag/tc_dflag

TP 1 tc_dflag PASS
##################################################
TC /tests/hflag/tc_hflag

TP 1 tc_hflag PASS

...

##################################################
                 SUMMARY      
                 =======      
 
Number of Tests : 50

PASS            : 50
FAIL            : 0
UNRESOLVED      : 0
UNINITIATED     : 0
OTHER           : 0
 
##################################################

Test Logs are at /var/tmp/results.27828, Journal File = /var/tmp/results.27828/testlog 

vk:honeymooners:/opt/SUNWstc-nc$

It's been almost a year since I started developing the test suite last Christmas (see the initial blog entry about nc-tet). Since then, I have lost part of the source code in hard drive crash, had to redo the source tree structure, fix ksh style, fix numerous bugs in test suite code and make the test suite more robust. One might ask whether having test suite for such a simple program like nc(1) was worth the hassle. I have only one answer to that: absolutely. First, it gives a confidence of not breaking (most of; see below) existing things when changing/adding functionality and second it helped me (and I hope the others participating/observing the code review on testing-discuss too) to explore what it takes to write a test suite from scratch (I will not go here into details whether I prefer CTI-TET over STF and vice versa).

The Beautiful code book (which I really recommend for anyone tinkering with any source code) contains a chapter called Beautiful tests by Alberto Savoia. I hope that at least some of the test purposes in nc test suite have some degree of beautifulness of at least one of the ways highlighted by Alberto (1. simplicity/efficiency, 2. help making the software being tested better in terms of quality and testability, 3. breadth/thoroughness).

One of the important questions for a test suite is code coverage level. Obviously, for software adhering to the OpenSolaris interface taxonomy model it is important that the test suite exercises all of the Committed interfaces and execution paths around those interfaces. For nc(1) this means a subset of the command line options and their arguments (see PSARC 2007/389 for the actual list). The key is certainly to test the features which are likely to break with an intrusive code change.

Very crude view of test coverage for nc(1) test suite (counting test purposes gives only very remote idea about real coverage but at least provides visual image) looks like this:

       rflag: +
       Tflag: +++++---
       pflag: +
       iflag: +-
       vflag: ++
       kflag: +
       Uflag: +-
       dflag: +
       uflag: ++-
       sflag: +-
       hflag: +
       nflag: +-
       wflag: +
  portranges: +---
       lflag: ++++++++----------

One plus character stands for one positive test purpose, minus is negative test purpose.

Side note: the above ASCII graph was produced using test-coverage-graph.sh script (which presumes certain naming scheme for test purpose files). Just pipe a file listing into the script with test purpose filenames compliant to the scheme used in ontest-stc2 gate and it will spew out graph similar to the above.

In the above musing about code coverage I left out an important piece - why some of the features are not tested. For nc(1) the yet untested part is the SOCKS protocol support. Basically, this is because test suite environment does not contain SOCKS server to test against. There might not be many people using the -x/-X options but from my own experience nothing is more frustrating than discovering some old dusty corner which had to be fixed long time ago or removed completely. So for now, on my workstation which sits behind SOCKS proxy I have the following in ~/.ssh/config for a server outside corporate network which hosts my personal mailbox so it is accessed every day:

Host bar
  User foo
  Hostname outside.kewl.org
  # NOTE: for nc(1) testing
  ProxyCommand /usr/bin/nc -x socks-proxy.foothere.bar outside.kewl.org %p
  ForwardAgent no
  ForwardX11 no

This ensures (along with upgrades of the workstation to recent Nevada builds periodically) that SOCKS support gets tested as well. And yes, ssh-socks5-proxy-connect(1) and ssh-http-proxy-connect(1) are not really needed.

Now with the test suite in place, anybody modifying nc(1) (there are some RFEs for nc in the oss-bit-size list and other bugfixes or features are also welcome) can have pretty high confidence that his change will not break things. Yes, this means that more nc(1) features are coming.

Friday Nov 07, 2008

Automatic webrev upload

I will start this one a little bit generically..

Part of standard OpenSolaris/Solaris development process is code review. To facilitate a review, a so-called webrev is needed. A webrev is set of HTML/text/PDF pages and documents which display all the changes between local repository containing the changes and its parent repository. To produce a webrev, simply switch to a repository and run the webrev script (it is part of SUNWonbld package, which can be downloaded from OpenSolaris download center.):

$ cd /local/Source/bugfix-314159.onnv
$ webrev

Assuming /opt/onbld/bin is present in your PATH a webrev will be generated under /local/Source/bugfix-314159.onnv/webrev/ directory.

For OpenSolaris changes, the webrev is usually uploaded to cr.opensolaris.org (every OpenSolaris member has an account automatically created for him) which serves it under http://cr.opensolaris.org/~OSol_username/ (where OSol_username is your OpenSolaris username) and a request for review with a link to the webrev is sent to one of the mailing lists relevant to the change.
Dan Price has written a script which produces RSS feed out of recently uploaded webrevs which is pretty handy substitute for feeds from news/headlines/magazines :)

For a long time I was basically doing the following:

$ cd /local/Source/bugfix-foo.onnv && webrev
$ scp -r webrev cr.opensolaris.org:bugfix-foo.onnv

This had two flaws: first it was slow (because of rcp protocol over SSH channel) and second I had to delete it via separate command (use sftp(1) and rename the old webrev to .trash directory) before uploading new version of the webrev (otherwise couple of permissions errors would follow).

To solve the first problem, rsync (with SSH transport) can be used which makes the upload nearly instantaneous. Second problem can be worked around by using incremental webrevs. Still, this does not seem good enough for code reviews with many iterations.

So, the change made in CR 6752000 introduces the following command line options for automatic webrev upload:

  • -U uploads the webrev
  • -n suppresses webrev generation
  • -t allows to specify custom upload target

webrev.1 man page has been updated to explain the usage. For common OpenSolaris code reviews the usage will probably mostly look like this:

$ webrev -O -U

This will upload the webrev to cr.opensolaris.org under directory named according to local repository name. Further invocations will replace the remote webrev with fresh version.
But it is possible to get more advanced. After the initial webrev is posted, an incremental webrev can be both generated and posted. Assuming you're switched to the repository (via bldenv) and we're dealing with 4th round of code review the following command will perform the task:

webrev_name=`basename $CODEMGR_WS`
webrev -O -U -o $CODEMGR_WS/${webrev_name}.rd4 \\
    -p $CODEMGR_WS/${webrev_name}.rd3

The above commands hide maybe not-so-obvious behavior so I'll try to explain it in the table:

+---------------------------+------------------------+-----------------------------------------------------+
| command                   | local webrev directory | remote webrev directory                             |
+---------------------------+------------------------+-----------------------------------------------------+
| webrev -O -U              | $CODEMGR_WS/webrev/    | cr.opensolaris.org/~OSOLuser/`basename $CODEMGR_WS` |
+---------------------------+------------------------+-----------------------------------------------------+
| webrev -O -o \\            | $CODEMGR_WS/my_webrev/ | cr.opensolaris.org/~OSOLuser/my_webrev              |
|   $CODEMGR_WS/my_webrev   |                        |                                                     |
+---------------------------+------------------------+-----------------------------------------------------+
| webrev -O \\               | $CODEMGR_WS/fix.rd2/   | cr.opensolaris.org/~OSOLuser/fix.rd2                |
|  -p $CODEMGR_WS/fix.rd1 \\ |                        |                                                     |
|  -o $CODEMGR_WS/fix.rd2   |                        |                                                     |
+---------------------------+------------------------+-----------------------------------------------------+

Basically, without the -o flag webrev will generate the webrev to local directory named 'webrev' but it will upload it to the directory named after basename of local repository. With the -o flag webrev will use the name of root directory of the repository it is called from for both local and remote storage. This is done to keep the default behavior of generating local webrev to directory named 'webrev'. At the same time, uploading different webrevs to the same remote directory named 'webrev' does not make sense.

NOTE: This behavior is also valid in the case when not enclosed in a repository via ws or bldenv, I have just used $CODEMGR_WS to express root directory of a workspace/repository.

Also, now it is possible to call webrev from within Cadmium Mercurial plugin, so all webrev commands can be prefixed with hg.

All in all, it was fun dealing with webrevs of webrev. I am looking forward to more entries in the RSS feed :)

NOTE: It will take some time before the changes appear in SUNWonbld packages offered by the download center so it's better to update the sources from the ssh://anon@hg.opensolaris.org/hg/onnv/onnv-gate repository and build and upgrade the SUNWonbld package from there.

Wednesday Sep 03, 2008

strsep() in libc

As of today, strsep() function lives in Nevada's libc (tracked by CR 4383867 and PSARC 2008/305). This constitutes another step in the quest for more feature-full (in terms of compatibility) libc in OpenSolaris. In binary form, the changes will be available in build 99. The documentation will be part of the string(3C) man page.

Here's a small example of how to use it:

#include <stdio.h>
#include <string.h>
#include <err.h>

int parse(const char \*str) {
        char \*p = NULL;
        char \*inputstring, \*origstr;
        int ret = 1;
        
        if (str == NULL)
                errx(1, "NULL string");

        /\*
         \* We have to remember original pointer because strsep()
         \* will change 'inputstr' pointer.
         \*/
        if ((origstr = inputstring = strdup(str)) == NULL)
                errx(1, "strdup() failed");

        printf("=== parsing '%s'\\n", inputstring);
        for ((p = strsep(&inputstring, ",")); p != NULL;
           (p = strsep(&inputstring, ","))) {
                if (p != NULL && \*p != '\\0')
                        printf("%s\\n", p);
                else if (p != NULL) {
                        warnx("syntax error");
                        ret = 0;
                        goto bad;
                }
        }
bad:
        printf("=== finished parsing\\n");
        free(origstr);
        return (ret);
}

int main(int argc, char \*argv[]) {
        if (argc != 2)
                errx(1, "usage: prog ");

        if (!parse(argv[1]))
                exit(1);

        return (0);
}

This example was actually used as a unit test (use e.g. "1,22,33,44" and "1,22,,44,33" as input string) and it also nicely illustrates important properties of strsep() behavior:

  • While searching for tokens, strsep() modifies the original string. This is shared property with strtok().
  • Unlike strtok(), strsep() is able to detect empty fields.

There is a function in Solaris' libc which can do token splitting and does not modify the original string - strcspn(). The other notable property of strsep() is that (unlike strtok()) it does not conform to ANSI-C. Time to draw a table:

 function(s)   ISO C90    modifies     detects
                           input     empty fields
-------------+----------+----------+--------------+
 strsep()        No          Yes         Yes
 strtok()        Yes         Yes         No
 strcspn()       Yes         No        Sort of

None of the above functions is bullet-proof. The bottom line is the user should decide which is the most suitable for given task and use it with its properties in mind.

Sunday Apr 27, 2008

Test suite for netcat

In OpenSolaris world we very much care about correctness and hate regressions (of any kind). If I loosely paraphrase Bryan Cantrill the degree of devotion should be obvious:

"Have you tested your change in every way you know of ? If not, do not go any further with the integration unless you do so."

This implies that ordinary bug fix should have a unit test accompanying it. But, unit tests are cumbersome when performed by hand and do not mean much if they are not accumulated over time.

For integration of Netcat into OpenSolaris I have developed number of unit tests (basically at least one for each command line option) and couple more after spotting some bugs in nc(1). This means that nc(1) is ripe for having a test suite so the tests can be performed automatically. This is tracked by RFE 6646967. The test suite will live in onnv-stc2 gate which is hosted and maintained by OpenSolaris Testing community.

To create a test suite one can choose between two frameworks: STF and CTI-TET. I have chosen the latter because I wanted to try something new and also because CTI-TET seems to be the recommended framework these days.

The work on nc test suite has started during Christmas break 2007 and after recovery from lost data it is now in pretty stable state and ready for code review. This is actually somewhat exciting because nc test suite is supposed to be the first OpenSolaris test suite developed in the open.

Fresh webrev is always stored on cr.opensolaris.org in nc-tet.onnv-stc2 directory. Everybody is invited to participate in the code review.

Code review should be performed via testing-discuss at opensolaris.org mailing list (subscribe via Testing / Discussions). It has web interface in the form of testing-discuss forum.

So, if you're familiar with ksh scripting or CTI-TET framework (both not necessary) you have unique chance to bash (not bash) my code ! Watch for official code review announcement on the mailing list in the next couple of days.

Lastly, another philosophical food for thought: Test suites are sets of programs and scripts which serve mainly one purpose - they should prevent bugs from happening in the software they test. But, test suites are software too. Presence of bugs in test suites is an annoying phenomenon. How to get rid of that one ?

Sunday Apr 13, 2008

poll(2) and POLLHUP with pipes in Solaris

During nc(1) preintegration testing, short time before it went back I had found that 'cat /etc/passwd | nc localhost 4444' produced endless loop with 100% CPU utilization, looping in calls doing poll(2) (I still remember my laptop suddenly getting much warmer than it should be and CPU fan cranking up). 'nc localhost 4444 < /etc/password' was not exhibiting that behavior.
The cause was a difference between poll(2) implementation on BSD and Solaris. Since I am working on Netcat in Solaris again (adding more features, stay tuned), it's time to take a look back and maybe even help people porting similar software from BSD to Solaris.

The issue appears because POLLHUP is set in read events bitfield for stdin after pipe is closed (or to be more precise - after the producer/write end is done) on Solaris. poll.c (which resembles readwrite() function from nc) illustrates the issue:

01 #include <stdio.h>
02 #include <poll.h>
03 
04 #define LEN  1024
05 
06 int main(void) {
07      int timeout = -1;
08      int n;
09      char buf[LEN];
10      int plen = LEN;
11 
12      struct pollfd pfd;
13 
14      pfd.fd = fileno(stdin);
15      pfd.events = POLLIN;
16 
17      while (pfd.fd != -1) {
18              if ((n = poll(&pfd, 1, timeout)) < 0) {
19                      err(1, "Polling Error");
20              }
21              fprintf(stderr, "revents = 0x%x [ %s %s ]\\n",
22                  pfd.revents,
23                  pfd.revents & POLLIN ? "POLLIN" : "",
24                  pfd.revents & POLLHUP ? "POLLHUP" : "");
25      
26              if (pfd.revents & (POLLIN|POLLHUP)) {
27                      if ((n = read(fileno(stdin), buf, plen)) < 0) {
28                              fprintf(stderr,
29                                  "read() returned neg. val (%d)\\n", n);
30                              return;
31                      } else if (n == 0) {
32                              fprintf(stderr, "read() returned 0\\n", n);
33                              pfd.fd = -1;
34                              pfd.events = 0;
35                      } else {
36                              fprintf(stderr, "read: %d bytes\\n", n);
37                      }
38              }
39      }
40 }

Running it on NetBSD (chosen because my personal non-work mailbox is hosted on a machine running it) produces the following:

otaku[~]% ( od -N 512 -X -v /dev/zero | sed 's/ [ \\t]\*/ /g'; sleep 3 ) | ./poll
revents = 0x1 [ POLLIN  ]
read: 1024 bytes
revents = 0x1 [ POLLIN  ]
read: 392 bytes
revents = 0x11 [ POLLIN POLLHUP ]
read() returned 0

I had to post-process the output of od(1) (because of difference between output of od(1) on NetBSD and Solaris) and slow the execution down a bit (via sleep) in order to make things more visible (try to run the command without the sleep and the pipe will be closed too quickly). On OpenSolaris the same program produces different pattern:

moose:~$ ( od -N 512 -X -v /dev/zero | sed 's/ [ \\t]\*/ /g' ; sleep 3 ) | ./poll 
revents = 0x1 [ POLLIN  ]
read: 1024 bytes
revents = 0x1 [ POLLIN  ]
read: 392 bytes
revents = 0x10 [  POLLHUP ]
read() returned 0

So, the program is now obviously correct. Had the statement on line 26 checked only POLLIN, the command above (with or without the sleep) would go into endless loop on Solaris:

revents = 0x11 [ POLLIN POLLHUP ]
read: 1024 bytes
revents = 0x11 [ POLLIN POLLHUP ]
read: 392 bytes
revents = 0x10 [  POLLHUP ]
revents = 0x10 [  POLLHUP ]
revents = 0x10 [  POLLHUP ]
...

Both OSes set POLLHUP after the pipe is closed. The difference is that while BSD always indicates POLLIN (even if there is nothing to read), Solaris strips it after data stream ended. So, which one is correct ? poll() function as described by OpenGroup says that "POLLHUP and POLLIN are not mutually exclusive". This means both implementations seem to conform to the IEEE Std 1003.1, 2004 Edition standard (part of POSIX) in this respect.

However, the POSIX standard also says:

    In each pollfd structure, poll ( ) shall clear the revents member, except that where the application requested a report on a condition by setting one of the bits of events listed above, poll ( ) shall set the corresponding bit in revents if the requested condition is true. In addition, poll ( ) shall set the POLLHUP, POLLERR, and POLLNVAL flag in revents if the condition is true, even if the application did not set the corresponding bit in events.

This might be still ok even though POLLIN flag remains to be set in NetBSD's poll() even after no data are available for reading (try to comment out lines 33,34 and run as above) because the standard says about POLLIN flag: For STREAMS, this flag is set in revents even if the message is of zero length.

Without further reading it is hard to tell how exactly should POSIX compliant poll() look like. On the Austin group mailing list there was a thread about poll() behavior w.r.t. POLLHUP suggesting this is fuzzy area.

Anyway, to see where exactly is POLLHUP set for pipes in OpenSolaris go to fifo_poll(). The function _sets_ the revents bit field to POLLHUP so the POLLIN flag is wiped off after that. fifo_poll() is part of fifofs kernel module which has been around in Solaris since late eighties (I was still in elementary school the year fifovnops.c appeared in SunOS code base :)). NetBSD has fifofs too but the POLLHUP flag gets set via bit logic operation in pipe_poll() which is part of syscall processing code. The difference between OpenSolaris and NetBSD (whoa, NetBSD project uses OpenGrok !) POLLHUP attitude (respectively) is now clear:

Thursday Apr 03, 2008

ZFS is going to save my laptop data next time

The flashback is still alive even weeks after: the day before my presentation at FIRST Technical Colloquium in Prague I brought my 2 years old laptop with the work-in-progress slides to the office. Since I wanted to finish the slides in the evening a live-upgrade process was fired off on the laptop to get fresh Nevada version. (of course, to show off during the presentation ;))
LU is very I/O intensive process and the red Ferrari notebooks tend to get _very_ hot. In the afternoon I noticed that the process failed. To my astonishment, the I/O operations started to fail. After couple of reboots (and zpool status / fmadm faulty commands) it was obvious that the disk cannot be trusted anymore. I was able to rescue some data from the ZFS pool which was spanning the biggest slice of the internal disk but not all data. (ZFS is not willing to get corrupted data out.) My slides were lost as well as other data.

After some time I stumbled upon James Gosling's blog entry about ZFS mirroring on laptop. This get me started (or more precisely I was astonished and wondered how is it possible that this idea escaped me because at that time ZFS had been in Nevada for a long time) and I have discovered several similar and more in-depth blog entries about the topic.
After some experiments with borrowed USB disk it was time to make it reality on a new laptop.

The process was a multi-step one:

  1. First I had to extend the free slice #7 on the internal disk so it spans the remaining space on the disk because it was trimmed after the experiments. In the end the slices look like this in format(1) output:
    Part      Tag    Flag     Cylinders         Size            Blocks
      0       root    wm       3 -  1277        9.77GB    (1275/0/0)   20482875
      1 unassigned    wm    1278 -  2552        9.77GB    (1275/0/0)   20482875
      2     backup    wm       0 - 19442      148.94GB    (19443/0/0) 312351795
      3       swap    wu    2553 -  3124        4.38GB    (572/0/0)     9189180
      4 unassigned    wu       0                0         (0/0/0)             0
      5 unassigned    wu       0                0         (0/0/0)             0
      6 unassigned    wu       0                0         (0/0/0)             0
      7       home    wm    3125 - 19442      125.00GB    (16318/0/0) 262148670
      8       boot    wu       0 -     0        7.84MB    (1/0/0)         16065
      9 alternates    wu       1 -     2       15.69MB    (2/0/0)         32130
    
  2. Then the USB drive was connected to the system and recognized via format(1):
    AVAILABLE DISK SELECTIONS:
           0. c0d0 
              /pci@0,0/pci-ide@12/ide@0/cmdk@0,0
           1. c5t0d0 
              /pci@0,0/pci1025,10a@13,2/storage@4/disk@0,0
    
  3. Live upgrade boot environment which was not active was deleted via ludelete(1M) and the slice was commented out in /etc/vfstab. This was needed to make zpool(1M) happy.
  4. ZFS pool was created out of the slice on the internal disk (c0d0s7) and external USB disk (c5t0d0). I had to force it cause zpool(1M) complained about the overlap of c0d0s2 (slice spanning the whole disk) and c0d0s7:
    # zpool create -f data mirror c0d0s7 c5t0d0
    
    For a while I have struggled with finding a name for the pool (everybody seems either to stick to the 'tank' name or come up with some double-cool-stylish name which I wanted to avoid because of the likely degradation of the excitement from that name) but then chosen the ordinary data (it's what it is, after all).
  5. I have verified that it is possible to disconnect the USB disk and safely connect it while an I/O operation is in progress:
    root:moose:/data# mkfile 10g /data/test &
    [1] 10933
    root:moose:/data# zpool status
      pool: data
     state: ONLINE
     scrub: none requested
    config:
    
    	NAME        STATE     READ WRITE CKSUM
    	data        ONLINE       0     0     0
    	  mirror    ONLINE       0     0     0
    	    c0d0s7  ONLINE       0     0     0
    	    c5t0d0  ONLINE       0     0     0
    
    errors: No known data errors
    
    It survived it without a hitch (okay, I had to wait for the zpool command to complete a little bit longer due to the still ongoing I/O but that was it) and resynced the contents automatically after the USB disk was reconnected:
    root:moose:/data# zpool status
      pool: data
     state: DEGRADED
    status: One or more devices are faulted in response to persistent errors.
    	Sufficient replicas exist for the pool to continue functioning in a
    	degraded state.
    action: Replace the faulted device, or use 'zpool clear' to mark the device
    	repaired.
     scrub: resilver in progress for 0h0m, 3.22% done, 0h5m to go
    config:
    
    	NAME        STATE     READ WRITE CKSUM
    	data        DEGRADED     0     0     0
    	  mirror    DEGRADED     0     0     0
    	    c0d0s7  ONLINE       0     0     0
    	    c5t0d0  FAULTED      0     0     0  too many errors
    
    errors: No known data errors
    
    Also, with heavy I/O it is needed to mark the zpool as clear after the resilver completes via zpool clear data because the USB drive is marked as faulty. Normally this will not happen (unless the drive really failed) because I will be connecting and disconnecting the drive only when powering on or shutting down the laptop, respectively.
  6. After that I have used Mark Shellenbaum's blog entry about ZFS delegated administration (it was Mark who did the integration) and ZFS Delegated Administration chapter from OpenSolaris ZFS Administration Guide and created permissions set for my local user and assigned those permissions to the ZFS pool 'data' and the user:
      # chmod A+user:vk:add_subdirectory:fd:allow /data
      # zfs allow -s @major_perms clone,create,destroy,mount,snapshot data
      # zfs allow -s @major_perms send,receive,share,rename,rollback,promote data
      # zfs allow -s @major_props copies,compression,quota,reservation data
      # zfs allow -s @major_props snapdir,sharenfs data
      # zfs allow vk @major_perms,@major_props data
    
    All of the commands had to be done under root.
  7. Now the user is able to create a home directory for himself:
      $ zfs create data/vk
    
  8. Time to setup the environment of data sets and prepare it for data. I have separated the data sets according to a 'service level'. Some data are very important (e.g. presentations ;)) to me so I want them multiplied via the ditto blocks mechanism so they are actually present 4 times in case of copies dataset property set to 2. Also, documents are not usually accompanied by executable code so the exec property was set to off which will prevent running scripts or programs from that dataset.
    Some data are volatile and in high quantity so they do not need any additional protection and it is good idea to compress them with better compression algorithm to save some space. The following table summarizes the setup:
       dataset             properties                       comment
     +-------------------+--------------------------------+------------------------+
     | data/vk/Documents | copies=2                       | presentations          |
     | data/vk/Saved     | compression=on exec=off        | stuff from web         |
     | data/vk/DVDs      | compression=gzip-8 exec=off    | Nevada ISOs for LU     |
     | data/vk/CRs       | compression=on copies=2        | precious source code ! |
     +-------------------+--------------------------------+------------------------+
    
    So the commands will be:
      $ zfs create -o copies=2 data/vk/Documents
      $ zfs create -o compression=gzip-3 -o exec=off data/vk/Saved
      ...
    
  9. Now it is possible to migrate all the data, change home directory of the user to /data/vk (e.g. via /usr/ucb/vipw) and relogin.

However, this is not the end of it but just beginning. There are many things to make the setup even better, to name a few:

  • setup some sort of automatic snapshots for selected datasets
    The set of scripts and SMF service for doing ZFS snapshots and backup (see ZFS Automatic For The People and related blog entries) made by Tim Foster could be used for this task.
  • make zpool scrub run periodically
  • detect failures of the disks
    This would be ideal to see in Gnome panel or Gnome system monitor.
  • setup off-site backup via SunSSH + zfs send
    This could be done using the hooks provided by Tim's scripts (see above).
  • Set quotas and reservations for some of the datasets.
  • Install ZFS scripts for Gnome nautilus so I will be able to browse, perform and destroy snapshots in nautilus. Now which set of scripts to use ? Chris Gerhard's or Tim Foster's ? Or should I just wait for the official ZFS support for nautilus to be integrated ?
  • Find how exactly will the recovery scenario (in case of laptop drive failure) will look like.
    To import the ZFS pool from the USB disk should suffice but during my experiments I was not able to complete it successfully.

With all the above the data should be safe from disk failure (after all disks are often called "spinning rust" so they are going to fail sooner or later) and also the event of loss of both laptop and USB disk.

Lastly, a philosophical thought: One of my colleagues considers hardware as a necessary (and very faulty) layer which is only needed to make it possible to express the ideas in software. This might seem extreme but come to think of it. ZFS is special in this sense - being a software which provides that bridge, it's core idea to isolate the hardware faults.

Thursday Jan 17, 2008

Adding dtrace SDT probes

It seems that many developers and dtrace users found themselves in a position where they wanted to add some SDT probes to a module to get more insight into what's going on but the had to pause and were thinking "okay, more probes. But where to put them ? Do I really need the additional probes when I already have the fbt ones ?". To do this, systematic approach is needed in order not to over-do or under-do. I will use KSSL (Solaris kernel SSL proxy [1]) for illustration.

With CR 6556447, tens of SDT probes were introduced into KSSL module and other modules which interface with it. Also, in addition to the new SDT probes, in KSSL we got rid of the KSSL_DEBUG macros compiled only in DEBUG kernels and substituted them with SDT probes. As a result, much better observability and error detection was achieved with both debug and non-debug kernels. The other option would be to create KSSL dtrace provider but that would be too big gun for what is needed to achieve.

Generically, the following interesting data points for data gathering/observation can be identified in code:

  • data paths
    When there is a more than one path how data could flow into a subsystem. E.g. for TCP we have couple of cases how SSL data could reach KSSL input queue. To identify where from exactly was tcp_kssl_input() called we use SDT probes:
    	if (tcp->tcp_listener || tcp->tcp_hard_binding) {
    ...
    		if (tcp->tcp_kssl_pending) {
    			DTRACE_PROBE1(kssl_mblk__ksslinput_pending,
    			    mblk_t \*, mp);
    			tcp_kssl_input(tcp, mp);
    		} else {
    			tcp_rcv_enqueue(tcp, mp, seg_len);
    		}
    	} else {
    ...
    		/\* Does this need SSL processing first? \*/
    			if ((tcp->tcp_kssl_ctx != NULL) &&
    			    (DB_TYPE(mp) == M_DATA)) {
    				DTRACE_PROBE1(kssl_mblk__ksslinput_data1,
    				    mblk_t \*, mp);
    				tcp_kssl_input(tcp, mp);
    			} else {
    				putnext(tcp->tcp_rq, mp);
    				if (!canputnext(tcp->tcp_rq))
    					tcp->tcp_rwnd -= seg_len;
    			}
    ...
    
  • data processed in while/for cycles
    To observe what happens in each iteration of the cycle. Can be used in code like this:
    while (mp != NULL) {
    
      DTRACE_PROBE1(kssl_mblk__handle_record_cycle, mblk_t \*, mp);
    
      /\* process the data \*/
      ...
    
      mp = mp->b_cont;
    }
    
  • switch statements
    If significant/non-trivial processing happens inside switch it may be useful to add SDT probes there too. E.g.:
      content_type = (SSL3ContentType)mp->b_rptr[0];
      switch(content_type) {
        /\* select processing according to type \*/
        case content_alert:
           DTRACE_PROBE1(kssl_mblk__content_alert, mblk_t \*, mp);
           ...
           break;
        case content_change_cipher_spec:
           DTRACE_PROBE1(kssl_mblk__change_cipher_spec, mblk_t \*, mp);
           ...
           break;
        default:
           DTRACE_PROBE1(kssl_mblk__unexpected_msg, mblk_t \*, mp);
           break;
      }
    
  • labels which cannot be (easily) identified in other way
    Useful if code which follows the label is generic (assignments, no function calls), e.g.:
                                    /\*
                                     \* Give this session a chance to fall back to
                                     \* userland SSL
                                     \*/
                                    if (ctxmp == NULL)
                                            goto no_can_do;
    ...
    no_can_do:
                                    DTRACE_PROBE1(kssl_no_can_do, tcp_t \*, tcp);
                                    listener = tcp->tcp_listener;
                                    ind_mp = tcp->tcp_conn.tcp_eager_conn_ind;
                                    ASSERT(ind_mp != NULL);
    

You've surely noticed that same of the probe definitions above have common prefix (kssl_mblk-). This is one of the things which make SDT probes so attractive. With prefixes it is possible to do the e.g. following:

sdt:::kssl_err-\*
{
  trace(timestamp);
  printf("hit error in %s\\n", probefunc);
  stack(); ustack();
}

The important part is that we do not specify module of function name. The implicit wildcard (funcname/probename left out) combined with explicit wildcard (prefix + asterisk) will lead to all KSSL error probes to be activated regardless of in which module or function there are defined. This is obviously very useful for technologies which span multiple Solaris subsystems or modules (such as KSSL).

The nice thing about the error probes is that they could be leveraged in test suites. For each test case we can first run dtrace script with the above probeset covering all KSSL errors in the background and after the test completes just check if it produced some data. If it did, then the test case can be considered as failed. No need to check kstat(1M) (and other counters), log files, etc.

Also, thanks to the way how dtrace probes are activated we can have both generic probeset (using this for lack of better term) as above with addition of probe specific action, e.g.:

/\* probeset of all KSSL error probes \*/
sdt:::kssl_err-\*
{
  trace(timestamp);
  printf("hit error in %s\\n", probefunc);
}

/\* 
  the probe definition is:
     DTRACE_PROBE2(kssl_err__bad_record_size,
         uint16_t, rec_sz, int, spec->cipher_bsize);
 \*/
sdt:kssl:kssl_handle_record:kssl_err-bad_record_size
{
  trace(timestamp);
  tracemem(arg0, 32);
  printf("rec_sz = %d , cipher_bsize = %d\\n", arg1, arg2);
}

If probe kssl_err-bad_record_size gets activated the generic probe will be activated (and fires) too because the probeset contains the probe.

Similarly to the error prefix, we can have data specific prefix. For KSSL it is kssl_mblk- prefix which could be used for tracing all mblks (msgb(9S)) as they flow through TCP/IP, STREAMS and KSSL modules. With such probes it is possible to do e.g. the following:

/\* how many bytes from a mblk to dump \*/
#define DUMP_SIZE       48

/\* use macros from  \*/
#define MBLKL(mp)       ((mp)->b_wptr - (mp)->b_rptr)
#define DB_FLAGS(mp)    ((mp)->b_datap->db_flags)

#define PRINT_MBLK_INFO(mp)                                             \\
        printf("mblk = 0x%p\\n", mp);                                    \\
        printf("mblk size = %d\\n", MBLKL((mblk_t \*)mp));                \\
        PRINT_MBLK_PTRS(mp);

#define PRINT_MBLK(mp)                                                  \\
                trace(timestamp);                                       \\
                printf("\\n");                                           \\
                PRINT_MBLK_INFO(mp);                                    \\
                printf("DB_FLAGS = 0x%x", DB_FLAGS((mblk_t \*)mp));      \\
                tracemem(((mblk_t \*)mp)->b_rptr, DUMP_SIZE);            \\
                tracemem(((mblk_t \*)mp)->b_wptr - DUMP_SIZE,            \\
                        DUMP_SIZE);

sdt:::kssl_mblk-\*
{
  trace(timestamp);
  printf("\\n");
  PRINT_MBLK(arg0)
}

This is actually an excerpt from my (currently internal) KSSL debugging suite.
An example of output from such probe can be seen in my Coloring dtrace output post.

For more complex projects it would be waste to stop here. Prefixes could be further structured. However, this has some drawbacks. In particular, I was thinking about having kssl_mblk- and kssl_err- prefixes. Now what to do for places where an error condition occurred _and_ we would like to see the associated mblk ? Using something like kssl_mblk_err-\* comes to ones mind. However, there is a problem with that - what about the singleton cases (only mblk, only err). Sure, using multiple wildcards in dtrace is possible (e.g. syscall::\*read\*:) but this will make it ugly and complicated given the number of mblk+err cases (it's probably safe to assume that the number of such cases will be low). Simply, it's not worth the hassle. Rather, I went with 2 probes.
To conclude, using structured prefixes is highly beneficial only for set of probes where categories/sub-prefixes create non-intersecting sets (e.g. data type and debug level).

Of course, all of the above is not valid only for kernel but also for custom userland probes !

[1] High-level description of KSSL can be found in blueprint 819-5782.

About

blog about security and various tools in Solaris

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today