Wednesday May 31, 2006

Solaris kernel is pre-emptible by default

Solaris kernel is pre-emptible by default

It is the fact that Solaris kernel can be pre-empted at any point of time. To verify this, fortunately we had a systemcall which would cause kernel to loop in while(1) for ever. At the same time we had 2 CPU machine. we made that system call from the application which caused CPU 1 to loop for ever. Now we started a while(1) loop in the user land and associated \*real time\* priority with the process and requested the kernel to bind it to CPU 1 on which kernel was already looping in while(1). The moment we did this, system got hung. The caue of the hang was that the while(1) inside the kernel got \*pre-empted\* and was switched to CPU 2 and the CPU 1 started running \*user land while(1) with real time priority\* on CPU 1. This caused both the CPU to loop in while 1.

This makes clear that the Solaris kernel is pre-emptible between CPU's by default. Now there was a question that when kernel is looping on the CPU, who forced it to switch to different CPU. It was found that when we return from \*interrupt\* a check is made for pre-emption. If any higher priority thread(than the current thread) is runnable on the CPU, OS forces the current thread to switch to different. That's how kernel got switched to the other CPU.

TO summarise it -

1. Solaris kernel can be prempted it some higher priotiry thread(RT) is runnable on the same CPU.
2. To Avoid this, we have to specially place a request to the kernel to not to preempt when we are executing critical code.
3. If we are holding a spin lock in the kernel we wont get pre-empted.

A small question here that comes to my mind is that if there was only one CPU, what would have happened ? I'm not very sure that Solaris would comeback to user land if while(1) loop is started in the kernel. It was only because there were 2 CPU's on the system, kernel while(1) got switched to another CPU. To verify this we can have shell running with real time priority on single CPU machine. From this shell we can start same application that invokes systemcall which will start while (1) loop within the kernel. If we are able to comeback to the shell prompt, this will make sure that the kernel was preempted for higher priority process. If the above is TRUE, kernel is completely preemptible in following situations -
- Kernel code is executing without any lock held
Otherwise we can assume that there are pre-emption points in the Solaris kernel releasing locks etc., where the check is made to yield CPU.

It may happen that the kernel may be preempted while holding a lock. If a kernel control path is holding multiple locks, it may be forced to yield CPU when it releases very first lock with other locks held. To avoid this, we may explicitely set flags to not to yield the kernel.

Linux 2.4 on the other hand is non-preemptible by default. The kernel control path has to yield it self to giveup CPU or it preempts itself when it goes to sleep on some wait queue. Where as Linux 2.6 can be configured to be preemptible. While returning from interrupt/exception, we can check if scheduling is required to yield CPU to high priority task even if kernel was interrupted. If kernel code was holding spin lock or was inside scheduler or handling soft IRQ's, we won't preempt else we can. While releasing spin locks, we check if pre-emption is required. If there is any higher priority thread runable on the CPU.

Nornally, Linux while returning from the interrupt/exception, just checks if the CPU was running in kernel mode earlier to the event(from the CS register). If so, it would just continue to resume from where it had left the kernel. Even if the timer interrupt has occured and the time slice of the current process has elasped.
Attempt is made to make Monta vista Linux kernel fully pre-emptible.

Beware: Solaris kernel programmers have to be careful while desigining the code because they should know the points where kernel should not be pre-empted. Where as Linux kernel progammer can be careless as far as pre-emption is concerned because they know that it is only they who can preempt the kernel. In 2.6 Linux kernel pre-emption points are designed such that pre-emption will happen only after critical code is executed. <script src="" type="text/javascript"> </script> <script type="text/javascript"> _uacct = "UA-903150-1"; urchinTracker(); </script>

Wednesday May 10, 2006

TCP/IP overview on Solaris


The write-up gives overview of the TCP/IP stack implementation on Solaris. The discussion starts with some of the stream's concepts like queues, message etc.,. The discussion is extended to explaining how TCP/IP is build as modules to fit in the stream's framework. We will thereafter see how packet's traverse up and down stream in the stack as messages. There is slight discussion on the SYNC queue(stream framework) and how it is used for asynchronously processing TCP/IP messages. Then there is a small discussion on the service routines that would process the messages on the queue's message queue in case it is not able to pass on the message to the next module. We have small discussion on the Fire-engine design related squeue processing. Finally we will see how the normal & TCP urgent data(OOB data) processed at TCP, stream head & sockfs level. The writeup is completely based on the knowledge gained as a result of experience, code browsing and reading some documents. I'd say that the write-up gives fair idea of the TCP/IP implementation on Solaris with chances of errors in my understanding of the subject. There is a great scope of modifications and comments are always welcome. For better understanding of the stream concepts, please refer the book on streams programming guide available at For better understanding of TCP/IP concepts, please refer TCP/IP illustrated vol. 1 by W. Richard Stevens .

Please find the complete writeup here...

Important kernel data structures related to TCP-IP and kernel stream's/squeue's and their linking can be found here...

some of the commands to debug TCP/IP on Solaris using 'scat' crash analyser
On Solaris 10: scat is a debugger which is used to analyse kernel live memory or crash dump. I'd like to introduce some of the scat commands that can be used to debug stream's and TCP/IP stack as well.
scat> stream -s
the output of the command will give all the streams currently active on the system. Each entry will have stream pointer, modules stacked on the stream, messages in the queue/syncq, and QFULL information.
scat> stream findproc [stream address]
From 'stream -s' output we can get the stream address. We can use stream address to find the process to which this stream belongs. The output contains proess name, file descriptor associated with the stream, vnode address associated with the stream.
scat> sdump [stream vnode address] sonode
from the vnode address associated with the stream(from stream findproc output), we can get to the sonode structure. sonode contains all the information about the socket.
scat> stream -l [stream address]
We can use stream address to find out details of the stream. This command gives detailed information about all the modules stacked on the stream, details of queue_t structure for each module(stream head also), list all the messages completely queued on the queue, state of all the modules queue and stream head. q_ptr field of the queue_t structure points to the private data for the module. For TCP module, q->q_ptr points to conn_s structure. conn_s is the connection structure for the TCP containing all conneciton specific information. conn_tcp field of conn_s structure points to the tcp_t structure for the TCP connection. for all these we need TCP's queue_t address which we can get from stream -l output. So, we can use sdump command of scat to debug queue_t, conn_s, tcp_t structures in the following way -

for queue_t structure

scat> sdump queue_t

for TCP conn_s structure

scat> sdump [TCP queue address] queue_t q_ptr
this will give us address of conn_s structure for the TCP

scat> sdump [conn_s address] conn_s
here we get the dump of the conn_s structure for the TCP.

for TCP tcp_t structure

scat> sdump [conn_s address] conn_s conn_tcp
this will give us address of tcp_t structure for the TCP

scat> sdump [tcp_t address] tcp_t
here we get the dump of the tcp_t structure for the TCP.

Note: in the entire write-up queue maps to queue_t, message block maps to mblk_t, data block maps dblk_t. I've not used actual fields of the data structures but just the names which denote the fields. Like there is mention of readp & writep for message block which corresponds to read pointer and write pointer of the mblk_t. So, is the best place to brows the source code.

<script src="" type="text/javascript"> </script> <script type="text/javascript"> _uacct = "UA-903150-1"; urchinTracker(); </script>

Monday Dec 05, 2005

Boot-net With BOOTPARAMS/RARP in Solaris

Introduction: “boot net” is term used for booting over the network. Another term for this process is termed as jumpstart. Just like booting from the disk, floppy, cd-ROM, Solaris has provided an environment to boot over the net. Solaris can boot from the network by issuing traditional boot net command at the OK prompt. This way of booting uses BOOTPARAMS/RARP protocol and has limitation of having complete boot setup within the client's subnet. Boot net with DHCP option is another way of booting over the network. With dhcp option, we have no constraints of having boot setup within client's subnet. This restriction comes from the fact that BOOTPARAMS/RARP protocol implementation on Solaris has no provision of exchanging netmask information where as DHCP provides netmask information to the client. Much advanced net-booting architectures are now introduced with Solaris 9. This is called wanboot which can be used to boot over the WAN. Along with many features, security is the most attractive feature introduced with wanboot. Even though, we have advanced much in the boot-net area, still there are old setup's which still use old traditional net-booting method. In the current write-up, we will focus only on the traditional boot net process. The topics covered are -

- configuring client at the boot server
- boot net process
- First stage of booting with inetboot
- second stage of booting with genunix
- Limitation with the existing design of inetboot.

The idea here is not to emphasis on kernel booting process but to know more about the network specific activities related to boot net. There are various stages of booting while booting from disk and in each stage of booting we have a pointer to the next boot image on the disk. Boot loader loads the next boot image from the disk in a specific location and jumps to the location where the boot image is loaded to pass on the control to the new boot image. Finally, Unix kernel is loaded which boot straps itself and brings up the system. While boot strapping, kernel loads various modules/drivers and configuration files in the memory from the disk. So, primarily all the booting images, kernel and its pre-requisite components are loaded from the disk in case of disk booting. In a very similar fashion, in boot net we have various stages of booting. At each stage of booting we load in different images in the memory from the boot server over NFS. So, what additional booting client has to do here at each stage of booting is to discover its identity, configure network interface properties, find the boot server, find root file handle, mount root file system over the NFS. Finally start loading the boot image/kernel image/required drivers/modules and configuration files in the memory from NFS server from where miniroot is mounted. So, primarily nothing changes when it comes to kernel booting. In case of disk booting, all the requests are served from the disk where as in case of boot net these requests are served from the NFS server. So, we will concentrate mainly on network related activities between client and the server when client is booting over the network. Sequence & fundamentals of Kernel boot strapping is out of scope of this writeup. This writeup is purely based on the experience gained while solving the issues related with boot net and studying the code in general. There is a great scope for modifications to the writeup in terms of expansion or rectification.

Please find the complete writeup here...

<script src="" type="text/javascript"> </script> <script type="text/javascript"> _uacct = "UA-903150-1"; urchinTracker(); </script>

Sunday Jul 10, 2005

nfslogd & ndbm design limitations

nfslogd is a daemon which runs at NFS server to log all the NFS activities from the clients. The subset of these activities is to store information about the files/links/directories. nfslogd does not use flat files to log these activities as the searching of data will become very inefficient. Instead nfslogd uses Solaris native database ndbm to log all these records. This makes searching/deleting/inserting the records very efficient. nfslogd stores two set of records for each file/link/directory. These records are primary & secondary. The write-up is aimed at exposing ndbm design limitations. These limitations are finally exposed to the user of ndbm i.e., nfslogd. As a result of this limitation, nfslogd restricts limited number of files/links to be created in a directory before it starts throwing dbm error. The write-up highlights following issues -
- nfslogd key & records
- nfslogd interfacing with ndbm database
- ndbm data-structure & interfaces
- ndbm database data organisation
- ndbm design insert/fetch/delete design
- positioning new record in the database & split mechnism
- limitations of the database and hence nfslogd
Please find complete writeup with the analysis here...

<script src="" type="text/javascript"> </script> <script type="text/javascript"> _uacct = "UA-903150-1"; urchinTracker(); </script>

Tuesday Jul 05, 2005

NFS (Client) I/O Architecture & Implementation In Solaris

This write-up details the architecture of Solaris I/O implementation for NFSv3 client. The document is helpful in getting insight into the complete life-cycle of NFS data transaction between client and server. The life-cycle involves various steps of NFS data processing by the kernel. These steps within the kernel are:
- receiving NFS read/write request from user application,
- processing data using various kernel framework,
- issuing RPC request to the NFS server,
- (NFS client) receiving the response from the NFS server &
- processing the data and finally returning to the user application which initiated the request.

Both SYNC & ASYNC framework for NFS client data transaction is elaborated. The idea is not to walk through the entire code base but to get familiar with the design and implementation of NFS clients read/write process in the multi-threaded kernel environment. Various kernel data-structures involved in client's NFS read/write is covered. Various kernel framework used by NFS client like kernel VFS, paging, VM, NFS etc., are touched, though not in-depth. This write-up contains certain examples that explain NFS clients read/write behavior in different situations. Clients data and attributes caching is covered explaining in brief open-to-close consistency implementation.
This is helpful in understanding & tackling the read/write/caching and related issues associated with NFS client. Not only this, the document serves as a roadmap for NFS v3 read/write process at the client end. Last but not the least, this write-up doesn't cover each and every details of NFS client related to the subject. So, there is a great scope for anybody interested to add more to this. NFS v4 read/write architecture is not very different from v3 except for delegation feature(serialization of read & write) & compounded RPC calls but they have not changed the NFS read/write architecture and design. The comparative study of NFS v3/v4 can be the next step to strengthen our belief.

Note: Some more modifications are added to the original write-up to make it look more complete. These are related to
-NFS dirty pages
-linking of various kernel & NFS data structures with the help of schematic diagram.

Complete write-up can be downloaded from

- Pdf format : nfs architecture.

<script src="" type="text/javascript"> </script> <script type="text/javascript"> _uacct = "UA-903150-1"; urchinTracker(); </script>



« April 2014