Wednesday Jul 08, 2015

Post-Wait mechanism

We have added post-wait, a light weight synchronization mechanism for use between threads of different processes or of same process. The post & wait operations are done with use of a key.

There can be scenarios where threads of cooperating processes in a multi-process multi-threaded application have to synchronize. The post-wait facility can be used to do so by having a thread in a process wait on a key and be notified by a post to that key from a thread in another process. The key serves as a channel or an endpoint for the post and wait operations. Posting to a key instead of a specific thread gives the flexibility of having any thread in a process wait on the key and be notified. This can be useful in a thread pool based implementation. In addition, multiple threads can wait on the same key. A post on this key would result in the notification being broadcast to all the waiting threads. With the use of the exclusive flag, the application can ensure that only one thread waits on a key if required. See post_associate(3C) man page for documentation about the flags.

Also, using post-wait API a payload data can be passed with the post notification. This can be useful in passing additional data where needed, in combination with the post notification and avoid implementing a separate mechanism for exchanging data.

Event ports have to be used to receive the post notifications. Event port is a common event notification framework that was introduced in Solaris 10. Use of event ports to wait for a post notification will be helpful in consolidating post notification with notifications from other event sources(like file descriptor poll events, Async I/O, file events etc) and so allow a thread to wait at one place for notifications from multiple event sources.

To receive a post notification, a key is generated using the api postwait_genkey(3C) and associated with an event port. This key is exchanged, using any existing IPC mechanism, with other processes that will post to the key. When posted, the post notification event is sent to the event port to which the key is associated. The receiving thread collects the post notification from the event port using port_get/n(3C) api.

A vector post(postwait_postn(3C) api is included, which will be helpful in posting to multiple keys using one call.

Post-Wait feature is available starting Solaris 11.2.

Refer following man pages for more details about the API.



Here is a simple example:-

/* waiter.c --- cc -m64 -o waiter waiter.c */
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <sys/types.h>
#include <port.h>

        int portfd;
        struct port_event pe;
        postwk_obj_t    pko;
        postwkey_t      pkey;

        portfd = port_create();
        if (portfd == -1) {

        pko.ko_flags = POSTWKEY_QUEUING;
        pko.ko_key = pkey;
        if (port_associate(portfd, PORT_SOURCE_POSTWAIT, (uintptr_t)&pko, 0, 0)
                 == -1) {

        printf("Key %llu\n", pkey);

        while (1) {
                 * wait for post.
                if (port_get(portfd, &pe, NULL)  == -1) {

                switch(pe.portev_source) {
                case PORT_SOURCE_POSTWAIT:

                        printf("Received post event from pid %d thread %d\n",
                            pe.portev_events, pe.portev_user);

                /* cases for other event source type can be added here */
                        printf("unexpected source %d\n", pe.portev_source);


/* poster.c --- cc -m64 -o poster poster.c */
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <sys/types.h>
#include <pthread.h>
#include <port.h>

main(int ac, char *av[])

        postwkey_t      pc;
        pid_t           pid;
        pthread_t       tid;

        if (ac < 2) {
                printf("Usage: %s <pwkey>\n", av[0]);

        pc = atoll(av[1]);

         * Post notification - with pid as event and thread id as data.
        pid = getpid();
        tid = pthread_self();
        printf("My pid %d tid %d\n", pid, tid);
        if (postwait_post(pc, pid, tid) == -1)

Run these programs on seperate terminal windows.

Key 8393086987104875006
Received post event from pid 25722 thread 1

$./poster 8393086987104875006
My pid 25722 tid 1

Monday Aug 27, 2007

File Events Notification

Recently, I added the File Events Notification(FEN) - PSARC 2007/027, facility to Solaris which can be used to monitor files and directories for changes. FEN is implemented as an event source under the Event Ports framework. This facility is available starting in Nevada build 72.

Many applications need to monitor files & directories for status change caused by non communicating processes. For example a file manager monitors the directory(folder) for changes so that it can update the list of files it displays when files are created or deleted. Some applications may need to monitor config files and reload them when they change. The current method is to stat the file or directory periodically to find out if they have changed.

The FEN facility provides a mechanism to monitor files and directories for status changes and receive event notification when the change occurs. This will serve as an appropriate and an efficient alternative to the current method where applications have to periodically stat files & directories.

There are different file system  change notification mechanisms available on other operating systems. Some of them implement event queuing in the kernel and provide additional context, like file names, along with the change event. For example when a directory is monitored, one event for each file that gets created in the directory, is delivered. The events get queued in the kernel while the application processes each of them. This exposes scalability issues.

Obviously, the rate at which events can get generated and delivered will be much faster then the rate at which the application receives and processes them. The events will have to be queued up locking down kernel memory till the application processes them. A limit will have to be imposed on the number of events that can get queued, which means events can be missed due to overflow. The application will have to implement a fall back method to handle the overflow condition. The scalability issues have been discussed on perf-discuss forum.

In the approach we have taken with FEN, it provides file change event notification in accordance with what an application can find out by stat'ing and comparing time stamps. There is no queuing of events. Once an event is delivered, the file/directory watch gets automatically de-activated(de registered). The application has to re-register the file to start watching. This behavior is intentional. It aids multi threaded programming using the file events notification API. It also helps filter out redundant events. High resolution time stamps are used to ensure that no change events get missed between the time a file is processed and it is registered to be monitored.

For example - a multi threaded program could create a pool of worker threads to process  file events. Now if file events watch is not de-activated  after event deliver, and continues to deliver events, then more then one thread could potentially collect events from one file and attempt to process the file. Then the application will have to implement synchronization mechanism, complicating the usage of this facility.

Where as, if the file events watch is de-registered after an event is delivered, one thread would collect the event and process the file without the possibility of another thread receiving an event and processing the same file. Once processing is finished the file can be re-register to activate the file events watch on it. Time stamps passed in at the time of re-registering will be used, to determine if the file has changed while it was being processed.

The FEN facility is implemented as an event source under the Event Ports framework. Hence the FEN API will be an extension to the Event Ports API.

Event Ports :

Is an unified event notification framework, that was introduced in Solaris 10. It provides API using which an application can collect event notifications from different subsystems(event sources). The current set of subsystems that provide event notification under the Event Port framework are :

  • Async I/O
  • Posix Timers
  • User events
  • Posix Message Queues
  • polling file descriptors

And now :-

  • File Events Notification(FEN)

With the Event ports architecture, there is an emphasis on aiding simpler multi threaded programming using the API.  Implementation of the event sources under the event ports framework attempt to support this. The event notification is inherently asynchronous as different function calls are required to register with an event source (port_associate() or event source specific call) and to collect events(port_get()/port_getn()), making the usage of the API flexible.

The Event ports framework is suitable for Event Driven programming or Event based programming model. The event loop will be, to wait on the event port, where events from different subsystems(event sources) are collected and call  relevant event handler routines to process the events. A pool of work threads can be created to execute this event loop, dispatching events(calling event handlers).  Having more subsystems providing event notifications under event ports framework will be helpful. We are looking at adding signal event notification to Event Ports.

You can read more about Event Ports(Called event completion framework) with some examples in this article.  Also refer Bart Smaalders's  blog entry which has a good example using event ports. Bart has been very influential in the Event Ports architecture. For API details, refer man page documentation for the Event Ports API functions - port_create(3C), port_associate(3C), port_dissociate(3C), port_get(3C), port_alert(3C), port_send(3C).

File Events Notification - API:

Event types

     Watchable events:
  •  FILE_ACCESS             /* Monitored file/directory was accessed */
  •  FILE_MODIFIED        /* Monitored file/directory was modified */
  •  FILE_ATTRIB             /* Monitored file/directory's ATTRIB was changed */
  •  FILE_NOFOLLOW     /* flag to indicate not to follow symbolic links */
     Exception events - cannot be filtered:
  •  FILE_DELETE                 /* Monitored file/directory was deleted */
  •  FILE_RENAME_TO        /* Monitored file/directory was renamed */
  •  FILE_RENAME_FROM /* Monitored file/directory was renamed */
  •  UNMOUNTED                /* Monitored file system got unmounted */
  •  MOUNTEDOVER           /* Monitored file/directory was mounted on */

Event types are defined in '<sys/port.h>'.

The File Events Notification source is identified by  'PORT_SOURCE_FILE', defined in '<sys/port.h>', when calling event port interface routines.

An event port is where the events from the different event sources are collected. The port is a file descriptor returned by port_create(3c) function call.

To start watching a file for change events, it needs to be registered with the FEN source. The port_associate(3C) call is used to register a file.

The events are collected using the port_get(3C)/port_getn(3C).

The port_dissociate(3C) call can be used to cancel a registered file watch.

Each event source defines an object type. The object type for the FEN source is

typedef struct file_obj {
        timestruc_t     fo_atime;     /* Access time got from stat() */
        timestruc_t     fo_mtime;   /* Modification time from stat() */
        timestruc_t     fo_ctime;    /* Change time from stat() */
        uintptr_t        fo_pad[3];   /* For future expansion */
        char               *fo_name;   /* Pointer to a Null terminated path name */
} file_obj_t;

Which is defined in sys/port.h

The object 'file_obj_t' is initialized with the name of the file to be monitored and the time stamps of the file collected with a stat(2) call. The object pointer is passed to port_associate(3C) to register the file to be watched. The object pointer is the handle, that can be used to cancel the file events watch registration if necessary, by passing it to port_dissociate(3C).

At the time a file is registered to be monitored, the time stamps of the file, passed in the file_obj_t, are compared with its current times tamp. If they have changed relevant events are generated. This is how it ensures that change events are not missed between the time the file is processed and it is registered to be watched.

While the file object 'file_obj_t' is in use and it gets reused to register a different file, the earlier file watch gets canceled and the new file specified will be watched.

Example :

This program takes the path name of the files to be monitored and prints the change events received. It reads path names from standard input(stdin). Any number of file names can be entered.

In another session, modify the files that are being monitored and see the change events being printed.

The program exits upon reading EOF(^D).

Note -  On NFS filesystem, file events due to operations local to the client only, are delivered. Change events occurring on the server will not be reported.
 * fen.c
 * cc  fen.c -o fen
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <fcntl.h>
#include <strings.h>
#include <port.h>
#include <errno.h>
#include <pthread.h>
#include <thread.h>

struct fileinfo {
        struct file_obj fobj;
        int events;
        int port;

printevent(int event, char *pname)
        printf("%s :",pname);
        if (event & FILE_ACCESS) {
                printf(" FILE_ACCESS");
        if (event & FILE_MODIFIED) {
                printf(" FILE_MODIFIED");
        if (event & FILE_ATTRIB) {
                printf(" FILE_ATTRIB");
        if (event & FILE_DELETE) {
                printf(" FILE_DELETE");
        if (event & FILE_RENAME_TO) {
                printf(" FILE_RENAME_TO");
        if (event & FILE_RENAME_FROM) {
                printf(" FILE_RENAME_FROM");
        if (event & UNMOUNTED) {
                printf(" UNMOUNTED");
        if (event & MOUNTEDOVER) {
                printf(" MOUNTEDOVER");

 * event handler for file events source.
process_file(struct fileinfo *finf, int revents)
        struct file_obj *fobjp = &finf->fobj;
        int port = finf->port;
        struct stat sb;

        if (!(revents & FILE_EXCEPTION) &&
                stat(fobjp->fo_name, &sb) == -1) {
                fprintf(stderr, "Failed to stat file : %s - errno %d\n",
                        fobjp->fo_name, errno);

         * Add what ever processing that needs to be done
         * here. Process received events.
        if (revents) {
                printevent(revents, fobjp->fo_name);
                 * If exception, no need to re-register.
                if (revents & FILE_EXCEPTION) {
         * re-register.
        fobjp->fo_atime = sb.st_atim;
        fobjp->fo_mtime = sb.st_mtim;
        fobjp->fo_ctime = sb.st_ctim;
        if (port_associate(port, PORT_SOURCE_FILE, (uintptr_t)fobjp,
                 finf->events, (void *)finf) == -1) {
                 * Add error processing as required, file may have been
                 * deleted/moved.
                fprintf(stderr, "Failed to register file :%s - errno %d\n",
                        fobjp->fo_name, errno);

 * worker threads wait here for event.
void *
waitforevents(void *pn)
        int port = *((int *)pn);
        port_event_t pe;

        while (!port_get(port, &pe, NULL)) {
                 * Can add cases for other sources if this
                 * port is used to collect events from multiple sources.
                switch (pe.portev_source) {
                case PORT_SOURCE_FILE:
                        /* Call file events event handler */
                        process_file((struct fileinfo *)pe.portev_object,
                        perror("Event from unexpected source");
        printf("worker thread exiting\n");

#define STRLEN 256
char str[STRLEN];

main ()
        struct fileinfo *finf;
        int port;
        char *stp;
        pthread_t tid;
        if ((port = port_create()) == -1) {

         * create a worker thread to process events. Can add as many
         * worker threads as required.
        pthread_create(&tid, NULL, waitforevents, (void *)&port);

        while(1) {
                if(fgets(str, STRLEN, stdin) == NULL) {
                        /* EOF */
                if ((stp = strstr(str, " ")) != NULL ||
                        (stp = strstr(str, "\n")) != NULL) {

                finf = malloc(sizeof(struct fileinfo));
                if (finf == NULL) {
                        perror("memory alloc");

                if ((finf->fobj.fo_name = strdup(str)) == NULL) {

                 * Event types to watch.
                finf->events = FILE_ACCESS|FILE_MODIFIED|FILE_ATTRIB;
                finf->port = port;

                 * Start monitor this file.
                process_file(finf, 0);
         * close port, will de-activate all file events watches associated
         * with the port.

         * wait for threads to exit.
        while (thr_join(0, NULL, NULL) == 0)

Tuesday Jun 14, 2005

Dynamic segkp for 32bit x86 systems

Dynamic segkp for 32bit x86 systems
OpenSolaris is here as promised. We are all really excited about that. Now, everyone interested can  look at
the source and understand how Solaris works.

To assist with that, I would like to describe a  change I introduced, with regards to  a
difference in the virtual memory layout between 32bit & 64bit x86 systems. If you happen
to look at the virtual memory layout in statup.c for x86, you would notice that 'segkp' is
not a separate segment on 32bit systems where as it is a separate segment on 64bit systems.

\* 32-bit Kernel's Virtual memory layout.
\* +-----------------------+
\* 0xFDFFE000 -|-----------------------|- ekernelheap, ptable_va
\* | | (segkp is an arena under the heap)
\* | |
\* | kvseg |
\* | |
\* | |
\* --- -|-----------------------|- kernelheap (floating)
\* | Segkmap |
\* 0xC3002000 -|-----------------------|- segkmap_start (floating)
\* | Red Zone |
\* 0xC3000000 -|-----------------------|- kernelbase / userlimit (floating)
\* 0x08048000 -|-----------------------|
\* | user stack |
\* : :
\* | invalid |
\* 0x00000000 +-----------------------+

On 32bit x86 systems, the virtual address space is shared between User and Kernel space
for performance reasons.  Therefore, the kernel address space is small.  As a result, the
32bit x86 systems run into memory exhaustion problems, mainly in the kernel heap space.
Read Kit Chow's blog entry on memory exhaustion issues.

In an attempt to provide little more breathing space for the kernel heap,  I was searching for
ways to free up some kernel space and looked at the segkp segment. This was not being
effectively used for a 32bit system where we are tight on kernel address space.

The segkp segment is used for allocating pageable kernel memory. The kernel stack for
threads are allocated from the segkp segment. Some necessary characteristic of this segment
driver are that it provides a redzone which is used at the end of the stack to protect against stack
overflows.  It also provides non pageable memory. 

The default segkp segment size was, a whooping 200MB. Yes, that is quite large when
kernel heap size can be less then 500 MB.

In most cases, on a 32 bit x86 systems,  the segkp segment space is not fully used, while
the system runs short of kernel heap space. On the other hand having a small segkp size
will limit the number of threads that can be created on a system. Note that each thread
that is created requires a kernel stack which  is allocated from the segkp segment.

Therefore the solution was to make segkp dynamic and combine its space with the kernel heap.
Then the system  can use the, limited kernel address space we have, more effectively, by having
a larger heap space and only using the required amount of virtual address space necessary
for segkp.

So, how was this done. Well, the vmem allocator in Solaris, makes it simple to implement this.
The vmem allocator is used to manage virtual address space in the kernel.  I will not go into
how the vmem allocator works as that is beyond the scope of this blog entry. Since, we wanted
to eliminate the segkp segment and add this free space to the kernel heap, all that was needed
was to make segkp a subset of the heap_arena instead of it being a separate segment. So, now
segkp will import memory from its source, the heap_arena, dynamically as and when required.

Here is the part of the code change, in seg_kp.c, which makes segkp a subset of the heap_arena.
(bug 4983788)

\* Allocate the virtual memory for segkp and initialize it
if (segkp_fromheap) {
np = btop(kvseg.s_size);
segkp_bitmap = kmem_zalloc(BT_SIZEOFMAP(np), KM_SLEEP);
kpsd->kpsd_arena = vmem_create("segkp", NULL, 0, PAGESIZE,
vmem_alloc, vmem_free, heap_arena, 5 \* PAGESIZE, VM_SLEEP);
} else {
segkp_bitmap = NULL;
np = btop(seg->s_size);
kpsd->kpsd_arena = vmem_create("segkp", seg->s_base,

Now, with this change, segkp is dynamic on a 32bit x86 systems, while retaining all of its characteristic
described above. The segment space originally occupied by the segkp segment is now clubbed
with the kernel heap space making its size larger by 200MB. Therefore there is no separate segkp segment.
The vmem allocator provides necessary caching.

Technorati Tag:
Technorati Tag:



« July 2016