News, tips, partners, and perspectives for the Oracle Solaris operating system

  • February 12, 2018

posix_spawn() as an actual system call

Casper Dik
Senior Principal Software Engineer


As a developer, there are always those projects when it is hard to find a way to go forward.  Drop the project for now and find another project, if only to rest your eyes and find yourself a new insight for the temporarily abandoned project.  This is how I embarked on posix_spawn() as an actual system call you will find in Oracle Solaris 11.4. The original library implementation of posix_spawn() uses vfork(), but why care about the old address space if you are not going to use it? Or, worse, stop all the other threads in the process and don't start them until exec succeeded or when you call exit()?

As I had already written kernel modules for nefarious reason to run executables directly from the kernel, I decided to benchmark the simple "make process, execute /bin/true" against posix_spawn() from the library. Even with two threads, posix_spawn() scaled poorly: additional threads did not allow a large number of additional spawns per second.

Starting a new process

All ways to start a new process need to copy a number of process properties: file descriptors, credentials, priorities, resource controls, etc.

The original way to start a new process is fork(); you will need to mark all the pages as copy-on-write (O(n) in the size of the number of pages in the process) and so this gets more and more expensive when the process get larger and larger. In Solaris we also reserve all the needed swap; a large process calling fork() doubles its swap requirement.

In BSD vfork() was introduced; it borrows the address space and was cheap when it was invented.  In much larger processes with hundreds of threads, it became more and more of bottleneck.  Dynamic linking also throws a spanner in the works: what you can do between vfork() and the final exec() is extremely small.

In the standard universe, posix_spawn() was invented; it was aimed mostly at small embedded systems and a very number of specific actions can be performed before the new executable is run.  As it was part of the standard, Solaris grew its own copy build on top of vfork(). It has, of course, the same problems as vfork() has; but because it is implemented in the library we can be sure we steer clear from all the other vfork() pitfalls.

Native spawn(2) call

The native spawn(2) system introduced in Oracle Solaris 11.4 shares a lot of code with the forkx(2) and execve(2).  It mostly avoids doing those unneeded operations:

  • do not stop all threads
  • do not copy any data about the current executable
  • do not clear all watch points (vfork())
  • do not duplicate the address space (fork())
  • no need to handle shared memory segments
  • do not copy one or more of the threads (fork1/forkall), create a new one instead
  • do not copy all file pointers
  • no need to restart all threads held earlier

The exec() call copies from its own address space but when spawn(2) needs the argument, it is already in a new process.  So early in the spawn(2) system call we copy the environment vector and the arguments and save them away.  The data blob is given to the child and the parent waits until the client is about to return from the system call in the new process or when it decides that it can't actually exec and calls exit instead.

A process can spawn(2) in all its threads and the concurrently is only limited by locks that need to be held shortly when processes are created.

The performance win depends on the application; you won't win anything unless you use posix_spawn(); I was very happy to see that our standard shell is using posix_spawn() to start new processes as do popen(3c) as well as system(3c) so the call is well tested.  The more threads you have, the bigger the win. Stopping a thread is expensive, especially if it hold up in a system call. The world used to stop but now it just continues.

Support in truss(1), mdb(1)

When developing a new system call special attention needs to be given to proc(5) and truss(1) interaction.  The spawn(2) system call is an exception but only because it is much harder to get it right; support is also needed in debuggers or they won't see a new process starting. This includes mdb(1) but also truss(1).  They also need to learn that when spawn(2) succeeds, that they are stopping in a completely different executable; we may also have crossed a privilege boundary, e.g., when spawning su(8) or ping(8).

Join the discussion

Comments ( 1 )
  • Alan Hargreaves Monday, February 12, 2018
    The original bourne shell using fork()/exec() is the reason that I never integrated the Dtrace provider. With it turned on, the performance hit was substantial. I considered moving it to use posix_spawn(), but at the time we we were deprecating that shell and the risks outweighed the benefit.
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.