Solaris Networking - The Magic Revealed (Part I)

Solaris Networking - The Magic Revealed Many of you have asked for details on Solaris 10 networking.  The great news is that I finished writing the treatise on the subject which will become a new section in Solaris Internals book by Jim Mauro and Richard Mcdougall.  In the meawhile, I have used some excerpts to create a mini book (part I and II) for Networking community on OpenSolaris. The Part II containing the new High Performance GLDv3 based device driver framework, tuning guide for Solaris 10, etc is below. Enjoy! As usual, comments (good or bad) are welcome.

Solaris Networking - The Magic Revealed (Part I)

  1. Background
  2. Solaris 10 stack
    1. Overview
    2. Vertical perimeter
    3. IP classifier
    4. Synchronization mechanism
  3. TCP
    1. Socket
    2. Bind
    3. Connect
    4. Listen
    5. Accept
    6. Close
    7. Data path
    8. TCP Loopback
  4. UDP
    1. UDP packet drop within the stack
    2. UDP Module
    3. UDP and Socket interaction
    4. Synchronous STREAMS
    5. STREAMs fallback
  5. IP
    1. Plumbing NICs
    2. IP Network MultiPathing (IPMP)
    3. Multicast
  6. Solaris 10 Device Driver framework
    1. GLDv2 and Monolithic DLPI drivers (Solaris 9 and before)
    2. GLDv3 - A New Architecture
    3. GLDv3 Link aggregation architecture
    4. Checksum offload
  7. Tuning for performance:
  8. Future
  9. Acknowledgments

1 Background

The networking stack of Solaris 1.x was a BSD variant and was pretty similar to the BSD Reno implementation. The BSD stack worked fine for low end machines but Solaris wanted to satisfy the needs of low end customers as well as enterprise customers and such migrated to AT&T SVR4 architecture which became Solaris 2.x.

With Solaris 2.x, the networking stack went through a make over and transitioned from a BSD style stack to STREAMs based stack. The STREAMs framework provided an easy message passing interface which allowed the flexibility of one STREAMs module interacting with other STREAM module. Using the STREAMs inner and outer perimeter, the module writer could provide mutual exclusion without making the implementation complex. The cost of setting up a STREAM was high but number of connection setup per second was not an important criterion and connections were usually long lived. When the connections were more long lived (NFS, ftp, etc.), the cost of setting up a new stream was amortized over the life of the connection.

During late 90s, the servers became heavily SMP running large number of CPUs. The cost of switching processing from one CPU  to another became high as the mid to high end machines became more NUMA centric. Since STREAMs by design did not have any CPU affinity, packets for a particular connections moved around to different CPU. It was apparent that Solaris needed to move away from STREAMs architecture.

Late 90s also saw the explosion of web and increase in processing power meant a large number of short lived connections making connection setup time equally important. With Solaris 10, the networking stack went through one more transition where the core pieces (i.e. socket layer, TCP, UPD, IP, and device driver) used an IP Classifier and serialization queue to improve the connection setup time, scalability, and packet processing cost. STREAMs are still used to provide the flexibility that ISVs need to implement additional functionality.

2 Solaris 10 stack

Lets have a look at how the new framework and its key components.


The pre SOlaris 10 stack uses STREAMS perimeter and kernel adaptive mutexes for multi-threading. TCP uses a STREAMS QPAIR perimeter, UDP uses a STREAMS QPAIR with PUTSHARED, and IP a PERMOD perimeter with PUTSHARED and various TCP, UDP, and IP global data structures protected by mutexes. The stack was executed by both user-land threads executing various system-calls, the network device driver read-side interrupt or device driver worker thread, and by STREAMS framework worker threads. As the current perimeter provides per module, per protocol stack layer, or horizontal perimeter. This can, and often does, lead to a packet being processed on more than one CPU and by more than one thread leading to excessive context switching and poor CPU data locality. The problem gets even more compounded by the various places packet can get queued under load and various threads that finally process the packet.

The "FireEngine" approach is to merge all protocol layers into one STREAMs module which is fully multi threaded. Inside the merged module, instead of using per data structure locks, use a per CPU synchronization mechanism called "vertical perimeter". The "vertical perimeter" is implemented using a serialization queue abstraction called "squeue". Each squeue is bound to a CPU and each connection is in turn bound to a squeue which provides any synchronization and mutual exclusion needed for the connection specific data structures.

The connection (or context) lookup for inbound packets is done outside the perimeter, using an IP connection classifier, as soon as the packet reaches IP. Based on the classification, the connection structure is identified. Since the lookup happens outside the perimeter, we can bind a connection to an instance of the vertical perimeter or "squeue" when the connection is initialized and process all packets for that connection on the squeue it is bound to maintaining better cache locality. More details about the vertical perimeter and classifier are given later sections. The classifier also becomes the database for storing a sequence of function calls necessary for all inbound and outbound packets. This allows to change the Solaris networking stacks from the current message passing interface to a BSD style function call interface. The string of functions created on the fly (event-list) for processing a packet for a connection is the basis for an eventual new framework where other modules and 3rd party high performance modules can participate in this framework.

Vertical perimeter

Squeue guarantees that only a single thread can process a given connection at any given time thus serializing access to the TCP connection structure by multiple threads (both from read and write side) in the merged TCP/IP module. It is similar to the STREAMS QPAIR perimeter but instead of just protecting a module instance, it protects the whole connection state from IP to sockfs.

Vertical perimeter or squeue by themselves just provide packet serialization and mutual exclusion for the data structures, but by creating per CPU perimeter and binding a connection to the instance attached to the CPU processing interrupts, we can guarantee much better data locality.

We could have chosen between creating a per connection perimeter or a per CPU perimeter i.e. a instance per connection or per CPU. The overheads involved with a per connection perimeter and thread contention gives lower performance and made us choose a per CPU instance. For a per CPU instance, we had the choice of queuing a connection structure for processing or instead just queue the packet itself and store the connection structure pointer in the packet itself. The former approach leads to some interesting starvation scenarios where packets for a connection keep arriving and to prevent such a situation, the overheads caused a lowered performance. Queuing the packet themselves allows us to protect the ordering and is much simpler and thus the approach we have taken for FireEngine.

As mentioned before, each connection instance is assigned to a single squeue and is thus only processed within the vertical perimeter. As a squeue is processed by a single thread at a time all data structures used to process a given connection from within the perimeter can be accessed without additional locking. This improves both the CPU and thread context data locality of access of both the connection meta data, the packet meta data, and the packet payload data. In addition this will allow the removal of per device driver worker thread schemes which are problematic in solving a system wide resource issue and allow additional strategic algorithms to be implemented to best handle a given network interface based on throughput of the network interface and the system throughput (e.g. fanning out per connection packet processing to a group of CPUs). The thread, entering squeue may either process the packet right away or queue it for later processing by another thread or worker thread. The choice depends on the squeue entry point and on the state of the squeue. The immediate processing is only possible when no other thread has entered the same squeue. The squeue is represented by the following abstraction:

typedef struct squeue_s {
int_t sq_flag; /\* Flags tells squeue status \*/
kmutex_t sq_lock; /\* Lock to protect the flag etc \*/
mblk_t \*sq_first; /\* First Packet \*/
mblk_t \*sq_last; /\* Last Packet \*/
thread_t sq_worker; /\* the worker thread for squeue \*/
} squeue_t;

Its important to note that the squeues are created on the basic of per H/W execution pipeline i.e. cores, hyper threads, etc. The stack processing of the serialization queue (and the H/W execution pipeline) is limited to one thread at a time but this actually improves performance because the new stack ensure that there are no waits for any resources such as memory or locks inside the vertical perimeter and allowing more than one kernel thread to time share the H/W execution pipelines has more overheads vs allowing only one thread to run uninterrupted.
  • Queuing Model - The queue is strictly FIFO (first in first out) for both read and write side which ensures that any particular connection doesn't suffer or is starved. A read side or a write side thread enqueues packet at the end of the chain. It can then be allowed to process the packet or signal the worker thread based on the processing model below.
  • Processing Model - After enqueueing its packet, if another thread is already processing the squeue, the enqueuing thread returns and the packet is drained later based on the drain model. If the squeue is not being processed and there are no packets queued, the thread can mark the squeue as being processed (represented by 'sq_flag'), and processes the packet. Once it completes processing the packet, it removes the 'processing in progress' flag and makes the squeue free for future processing.
  • Drain Model - A thread, which was successfully able to process its own packet, can also drain any packets that were enqueued while it was processing the request. In addition, if the squeue is not being processed but there are packets already queued, then instead of queuing its packet and leaving, the thread can drain the queue and then process its own packets.

The worker thread is always allowed to drain the entire queue. Choosing the correct Drain model is quite complicated. Choices are
  • "always queue",
  • "process your own packet if you can",
  • "time bounded process and drain".
These options can be independently applied to the read thread and the write thread.

Typically, the draining by an interrupt thread should always be time-bounded "drain and process" while the write thread can choose between "processes your own" and time bounded "process and drain". For Solaris 10, the write thread behavior is a tunable with default being "process your own" while the read side is fixed to "time bounded process and drain".

The signaling of worker thread is another option worth exploring. If the packet arrival rate is low and a thread is forced to queue its packet, then the worker thread should be allowed to run as soon as the entering thread finished processing the squeue when there is work to be done.

On the other hand, if the packet arrival rate is high, it may be desirable to delay waking up the worker thread hoping for an interrupt to arrive shortly after to complete the drain. Waking up the worker thread immediately when the packet arrival rate is high creates unnecessary contention between the worker and interrupt threads.

The default for Solaris 10 is delayed wakeup of the worker thread. Initial experiments on available servers showed that the best results are obtained by waking up the worker thread after a 10ms delay.

Placing a request on the squeue requires a per-squeue lock to protect the state of the queue, but this doesn't introduce scalability problems because it is distributed between CPU's and is only held for a short period of time. We also utilize optimizations, which allow avoiding context switches while still preserving the single-threaded semantics of squeue processing. We create an instance of an squeue per CPU in the system and bind the worker thread to that CPU. Each connection is then bound to a specific squeue and thus to a specific CPU as well.

The binding of an squeue to a CPU can be changed but binding of a connection to an squeue never changes because of the squeue protection semantics. In the merged TCP/IP case, the vertical perimeter protects the TCP state for each connection. The squeue instance used by each connection is chosen either at the "open", "bind" or "connect" time for outbound connections or at "eager connection creation time" for inbound ones.

The choice of the squeue instance depends on the relative speeds of the CPUs and the NICs in the system. There are two cases:
  • CPU is faster than the NIC: the incoming connections are assigned to the "squeue instance" of the interrupted CPU. For the outbound case, connections are assigned to the squeue instance of the CPU the application is running on.
  • NIC is faster than the CPU: A single CPU is not capable of handling the NIC. The connections are bounded in random manner on all available squeue.
For Solaris 10, the determination of NIC being faster or slower than CPU is done by the system administrator in the form of a tuning the global variable 'ip_squeue_fanout'. The default is 'no fanout' i.e. Assign the incoming connection to the squeue attached to the interrupted CPU. For the purposes of taking a CPU offline the worker thread bound to this CPU removes its binding and restores it when the CPU gets back online. This allows for the DR functionality to work correctly. When packets for a connection are arriving on multiple NICs (and thus interrupting multiple CPUs), they are always processed on the squeue the connection was originally established on. In Solaris 10, the vertical perimeter are provided only for TCP based connections. The interface to vertical perimeter is done at the TCP and IP layer after determining that it is a TCP connection. Solaris 10 updates will introduce the general vertical perimeter for any use.
The squeue APIs look like:

squeue_t \*squeue_create(squeue_t \*, uint32_t, processorid_t, void (\*)(), void \*, clock_t, pri_t);
void squeue_bind(squeue_t \*, processorid_t);
void squeue_unbind(squeue_t \*);
void squeue_enter(squeue_t \*, mblk_t \*, void (\*)(), void \*);
void squeue_fill(squeue_t \*, mblk_t \*, void (\*)(), void \*);

Squeue_create instantiates a new squeue and uses squeue_bind()/squeue_unbind() to bind or unbind itself from a particular CPU. The squeue once created are never destroyed. The squeue_enter() is used to try and access the squeue and the thread entering is allowed to process and drain the squeue based on models discussed before. squeue_fill() is used just to queue a packet on the squeue to be processed by worker thread or other threads.

IP classifier

The IP connection fanout mechanism consists of 3 hash tables. A 5-tuple hash table {protocol, remote and local IP addresses, remote and local ports} to keep fully qualified TCP (ESTABLISHED) connections, A 3-tuple lookup consisting of protocol, local address and local port to keep the listeners and a single-tuple lookup for protocol listeners. As part of the lookup, a connection structure (a superset of all connection information) is returned. This connection structure is called 'conn_t' and is abstracted below.

typedef struct conn_s {
kmutex_t conn_lock; /\* Lock for conn_ref \*/
uint32_t conn_ref; /\* Reference counter \*/
uint32_t conn_flags; /\* Flags \*/

struct ill_s \*conn_ill; /\* The ill packets are coming on \*/
struct ire_s \*conn_ire; /\* ire cache for outbound packets \*/
tcp_t \*conn_tcp; /\* Pointer to tcp struct \*/
void \*conn_ulp /\* Pointer for upper layer\*/
edesc_pf conn_send; /\* Function to call on read side \*/
edesc_pf conn_recv; /\* Function to call on write side \*/
squeue_t \*conn_sqp; /\* Squeue for processing \*/

/\* Address and Ports \*/
struct {
in6_addr_t connua_laddr; /\* Local address \*/
in6_addr_t connua_faddr; /\* Remote address. \*/
} connua_v6addr;
#define conn_src V4_PART_OF_V6(connua_v6addr.connua_laddr)
#define conn_rem V4_PART_OF_V6(connua_v6addr.connua_faddr)
#define conn_srcv6 connua_v6addr.connua_laddr
#define conn_remv6 connua_v6addr.connua_faddr
union {
/\* Used for classifier match performance \*/
uint32_t conn_ports2;
struct {
in_port_t tcpu_fport; /\* Remote port \*/
in_port_t tcpu_lport; /\* Local port \*/
} tcpu_ports;
} u_port;
#define conn_fport u_port.tcpu_ports.tcpu_fport
#define conn_lport u_port.tcpu_ports.tcpu_lport
#define conn_ports u_port.conn_ports2
uint8_t conn_protocol; /\* protocol type \*/
kcondvar_t conn_cv;
} conn_t;

The interesting member to note is the pointer to the squeue or vertical perimeter. The lookup is done outside the perimeter and the packet is processed/queued on the squeue connection is attached to. Also, conn_recv and conn_send point to the read side and write side functions. The read side function can be 'tcp_input' if the packet is meant for TCP.

Also, the connection fan-out mechanism has provisions for supporting wildcard listener's i.e. INADDR ANY. Currently, the connected and bind tables are primarily for TCP and UDP only. A listener entry is made during a listen() call. The entry is made into the connected table after the three-way handshake is complete for TCP.

The IPCLassifier APIs look like:

conn_t \*ipcl_conn_create(uint32_t type, int sleep);
void ipcl_conn_destroy(conn_t \*connp);

int ipcl_proto_insert(conn_t \*connp, uint8_t protocol);
int ipcl_proto_insert_v6(conn_t \*connp, uint8_t protocol);
conn_t \*ipcl_proto_classify(uint8_t protocol);
int \*ipcl_bind_insert(conn_t \*connp, uint8_t protocol, ipaddr_t src,
uint16_t lport);
int \*ipcl_bind_insert_v6(conn_t \*connp, uint8_t protocol,
const in6_addr_t \* src, uint16_t lport);
int \*ipcl_conn_insert(conn_t \*connp, uint8_t protocol, ipaddr_t src,
ipaddr_t dst, uint32_t ports);
int \*ipcl_conn_insert_v6(conn_t \*connp, uint8_t protocol,
in6_addr_t \*src, in6_addr_t \*dst, uint32_t ports);
void ipcl_hash_remove(conn_t \*connp);
conn_t \*ipcl_classify_v4(mblk_t \*mp);
conn_t \*ipcl_classify_v6(mblk_t \*mp);
conn_t \*ipcl_classify(mblk_t \*mp);

The names of the functions are pretty self explanatory.

Synchronization mechanism

Since the stack is fully multi-threaded (barring the per CPU serialization enforced by the vertical perimeter), it uses a reference based scheme to ensure that connection instance are available when needed. The reference count is implemented by 'conn_t' member 'conn_ref' and protected by 'conn_lock'. The prime purpose of the lock in not to protect bulk of 'conn_t' but just the reference count. Each time some entity takes reference to the data structure (stores a pointer to the data structure for later processing), it increments the reference count by calling the CONN_INC_REF macro which basically acquires the 'conn_lock', increments the 'conn_ref' and drops the 'conn_lock'. Each time the entity drops the reference to the connection instance, it drops its reference using the CONN_DEC_REF macro.

For an established TCP connection, There are guaranteed to be 3 references on it. Each protocol layer has a reference on the instance (one each for TCP and IP) and the classifier itself has a reference since its a established connection. Each time a packet arrive for the connection and classifier looks up the connection instance, an extra reference is place which is dropped when the protocol layer finishes processing that packet. Similarly, any timers running on the connection instance have a reference to ensure that the instance is around whenever timer fires. The memory associated with the connection instance is freed once the last reference is dropped.


Solaris 10 provides the same view for TCP as previous releases i.e. TCP appears as a clone device but it is actually a composite, with the TCP and IP code merged into a single D_MP STREAMS module. The merged TCP/IP module's STREAMS entry points for open and close are the same as IP's entry points viz ip_open and ip_close. Based on the major number passed during open, IP decides whether the open corresponds to a TCP open or an IP open. The put and service STREAMS entry points for TCP are tcp_wput, tcp_wsrv and tcp_rsrv. The tcp_wput entry point simply serves as a wrapper routine and enable sockfs and other modules from the top to talk to TCP using STREAMs. Note that tcp_rput is missing since IP calls TCP functions directly. IP's STREAMS entry points remain unchanged.

The operational part of TCP is fully protected by the vertical perimeter which entered through the squeue_\* primitives as illustrated in Fig 4. Packets flowing from the top enter into TCP through the wrapper function tcp_wput, which then tries to execute the real TCP output processing function tcp_output after entering the corresponding vertical perimeter. Similarly packets coming from the bottom try to execute the real TCP input processing function tcp_input after entering the vertical perimeter. There are multiple entry points into TCP through the vertical perimeter.

Fig. 4

Fig. 4

tcp_input - All inbound data packets and control messages
tcp_output - All outbound data packets and control messages
tcp_close_output - On user close
tcp_timewait_output - timewait expiry
tcp_rsrv_input - Flowcontrol relief on read side.
tcp_timer - All tcp timers

The Interface between TCP and IP

FireEngine changes the interface between TCP and IP from the existing STREAMS based message passing interface to a functional call based interface, both in the control and data paths. On the outbound side TCP passes a fully prepared packet directly to IP by calling ip_output, while being inside the vertical perimeter.
Similarly control messages are also passed directly as function arguments. ip_bind_v{4, 6} receives a bind message as an argument, performs the required action and returns a result mp to the caller. TCP directly calls ip_bind_v{4, 6} in the connect(), bind() and listen() paths. IP still retains all its STREAMs entry point but TCP (/dev/tcp) becomes a real device driver i.e. It can't be pushed over other device drivers.

The basic protocol processing code was unchanged. Lets have a look at common socket calls and see how they interact with the framework.


A socket open of TCP or open of /dev/tcp eventually calls into ip_open. The open then calls into the IP connection classifier and allocates the per-TCP endpoint control block already integrated with the conn_t. It chooses the squeue for this connection. In the case of an internal open i.e by sockfs for an acceptor stream, almost nothing is done, and we delay doing useful work till accept time.


tcp_bind eventually needs to talk to IP to figure out whether the address passed in is valid. FireEngine TCP prepares this request as usual in the form of a TPI message. However this messages is directly passed as a function argument to ip_bind_v{4, 6}, which returns the result as another message. The use of messages as parameters is helpful in leveraging the existing code with minimal change. The port hash table used by TCP to validate binds still remains in TCP, since the classifier has no use for it.


The changes in tcp_connect are similar to tcp_bind. The full bind() request is prepared as a TPI message and passed as a function argument to ip_bind_v{4, 6}. IP calls into the classifier and inserts the connection in the connected hash table. The conn_ hash table in TCP is no longer used.


This path is part of tcp_bind. The tcp_bind prepares a local bind TPI message and passes it as a function argument to ip_bind_v{4, 6}. IP calls the classifier and inserts the connection in the bind hash table. The listen hash table of TCP does not exist any more.


The pre Solaris 10 accept implementation did the bulk of the connection setup processing in the listener context. The three way handshake was completed in listener's perimeter and the connection indication was sent up the listener's STREAM. The messages necessary to perform the accept were sent down on the listener STREAM and the listener was single threaded from the point of sending the T_CONN_RES message to TCP till sockfs received the acknowledgment. If the connection arrival rate was high, the ability of pre Solaris 10 stack to accept new connections deteriorated significantly.

Furthermore, there were some additional TCP overhead involved, which contribute to slower accept rate. When sockfs opened an acceptor STREAM to TCP to accept a new connection, TCP was not aware that the data structures necessary for the new connection have already been allocated. So it allocated new structures and initializes them but later as part of the accept processing these are freed. Another major problem with the pre Solaris 10 design was that packets for a newly created connection arrived on the listener's perimeter. This requires a check for every incoming packet and packets landing on the wrong perimeter need to be sent to their correct perimeter causing additional delay.

The FireEngine model establishes an eager connection (a incoming connection is called eager till accept completes) in its own perimeter as soon as a SYN packet arrives thus making sure that packets always land on the correct connection. As a result it is possible to completely eliminate the TCP global queues. The connection indication is still sent to the listener on the listener's STREAM but the accept happens on the newly created acceptor STREAM (thus, there is no need to allocate data structures for this STREAM) and the acknowledgment can be sent on the acceptor STREAM. As a result, sockfs doesn't need to become single threaded at any time during the accept processing.

The new model was carefully implemented because the new incoming connection (eager) exists only because there is a listener for it and both eager and listener can disappear at any time during accept processing as a result of eager receiving a reset or listener closing.

The eager starts out by placing a reference on the listener so that the eager reference to the listener is always valid even though the listener might close. When a connection indication needs to be sent after the three way handshake is completed, the eager places a reference on itself so that it can close on receiving a reset but any reference to it is still valid. The eager sends a pointer to itself as part of the connection indication message, which is sent via the listener's STREAM after checking that the listener has not closed. When the T_CONN_RES message comes down the newly created acceptor STREAM, we again enter the eager's perimeter and check that the eager has not closed because of receiving a reset before completing the accept processing. For TLI/XTI based applications, the T_CONN_RES message is still handled on the listener's STREAM and the acknowledgment is sent back on listener's STREAMs so there is no change in behavior.


Close processing in tcp now does not have to wait till the reference count drops to zero since references to the closing queue and references to the TCP are now decoupled. Close can return as soon as all references to the closing queue are gone. The TCP data structures themself may continue to stay around as a detached TCP in most cases. The release of the last reference to the TCP frees up the TCP data structure.
A user initiated close only closes the stream. The underlying TCP structures may continue to stay around. The TCP then goes through the FIN/ACK exchange with the peer after all user data is transferred and enters the TIME_WAIT state where it stays around for a certain duration of time. This is called a detached TCP. These detached TCPs also need protection to prevent outbound and inbound processing from happening at the same time on a given detached TCP.

Data path

TCP does not even need to call IP to transmit the outbound packet in the most common case, if it can access the IRE. With a merged TCP/IP we have the advantage of being able to access the cached ire for a connection, and TCP can putnext the data directly to the link layer driver based on the information in the IRE. FireEngine does exactly the above.

TCP Loopback

TCP Fusion is a protocol-less data path for loopback TCP connections in Solaris 10. The fusion of two local TCP endpoints occurs at connection establishment time. By default, all loopback TCP connections are fused. This behavior may be changed by setting the system wide tunable do tcp fusion to 0. Various conditions on both endpoints need to be met for fusion to be successful:
  • They must share a common squeue.
  • They must be TCP and not "raw socket".
  • They must not require protocol-level processing, i.e. IPsec or IPQoS policy is not present for the connection.
If it fails, we fall back to the regular TCP data path; if it succeeds, both endpoints proceed to use tcp fuse output() as the transmit path. tcp fuse output() enqueues application data directly onto the peer's receive queue; no protocol processing is involved. After enqueueing the data, the sender can either push - by calling putnext(9F) - the data up the receiver's read queue; or the sender can simply return and let the receiver retrieve the enqueued data via the synchronous STREAMS entry point. The latter path is taken if synchronous STREAMS is enabled.It gets automatically disabled if sockfs no longer resides directly on top of TCP module due to a module insertion or removal.

Locking in TCP Fusion is handled by squeue and the mutex tcp fuse lock. One of the requirements for fusion to succeed is that both endpoints need to be using the same squeue. This ensures that neither side can disappear while the other side is still sending data. By itself, squeue is not sufficient for guaranteeing safe access when synchronous STREAMS is enabled. The reason is that tcp fuse rrw() doesn't enter the squeue, and its access to tcp rcv list and other fusion-related fields needs to be synchronized with the sender. tcp fuse lock is used for this purpose.

Rate Limit for Small Writes Flow control for TCP Fusion in synchronous stream mode is achieved by checking the size of receive buffer and the number of data blocks, both set to different limits. This is different than regular STREAMS flow control where cumulative size check dominates data block count check (STREAMS queue high water mark typically represents bytes). Each enqueue triggers notifications sent to the receiving process; a build up of data blocks indicates a slow receiver and the sender should be blocked or informed at the earliest moment instead of further wasting system resources. In effect, this is equivalent to limiting the number of outstanding segments in flight.

The minimum number of allowable enqueued data blocks defaults to 8 and is changeable via the system wide tunable tcp_fusion_burst_min to either a higher value or to 0 (the latter disables the burst check).


Apart from the framework improvements, Solaris 10 made additional changes in the UDP packets move through the stack. The internal code name for the project was "Yosemite". Pre Solaris 10, the UDP processing cost was evenly divided between per packet processing cost and per byte processing cost. The packet processing cost was generally due to STREAMS; the stream head processing; and packet drops in the stack and driver. The per byte processing cost was due to lack of H/W cksum and unoptimized code branches throughout the network stack.

UDP packet drop within the stack

Although UDP is supposed to be unreliable, the local area networks have become pretty reliable and applications tend to assume that there will be no packet loss in a LAN environment. This assumption was largely true but pre Solaris 10 stack was not very effective in dealing with UDP overload and tended to drop packets within the stack itself.

On Inbound, packets were dropped at more than one layers throughout the receive path. For UDP, the most common and obvious place is at the IP layer due to the lack of resources needed to queue the packets. Another important yet in-apparent place of packet drops is at the network adapter layer. This type of drop is fairly common to occur when the machine is dealing with a high rate of incoming packets.

UDP sockfs The UDP sockfs extension (sockudp) is an alternative path to socktpi used for handling sockets-based UDP applications. It provides for a more direct channel between the application and the network stack by eliminating the stream head and TPI message-passing interface. This allows for a direct data and function access throughout the socket and transport layers. This allows the stack to become more efficient and coupled with UDP H/W checksum offload (even for fragmented UDP), ensures that UDP packets are rarely dropped within the stack.

UDP Module

A fully multi-threaded UDP module running under the same protection domain as IP. It allows for a tighter integration of the transport (UDP) with the layers above and below it. This allows socktpi to make direct calls to UDP. Similarly UDP may also make direct calls to the data link layer. In the post GLDv3 world, the data link layer may also make direct calls to the transport. In addition, utility functions can be called directly instead of using message-based interface.

UDP needs exclusive operation on a per-endpoint basis, when executing functions that modify the endpoint state. udp rput other() deals with packets with IP options, and processing these packets end up having to update the endpoint's option related state. udp wput other() deals with control operations from the top, e.g. connect(3SOCKET) that needs to update the endpoint state. In the STREAMS world this synchronization was achieved by using shared inner perimeter entry points, and by using qwriter inner() to get exclusive access to the endpoint.

The Solaris 10 model uses an internal, STREAMS-independent perimeter to achieve the above synchronization and is described below:
  • udp enter() - Enter the UDP endpoint perimeter. udp become writer() i.e.become exclusive on the UDP endpoint. Specifies a function that will be called exclusively either immediately or later when the perimeter is available exclusively.
  • udp exit() - Exit the UDP endpoint perimeter.
Entering UDP from the top or from the bottom must be done using udp enter(). As in the general cases, no locks may be held across these perimeter. When finished with the exclusive mode, udp exit() must be called to get out of the perimeter.

To support this, the new UDP model employs two modes of operation namely UDP MT HOT mode and UDP SQUEUE mode. In the UDP MT HOT mode, multiple threads may enter a UDP endpoint concurrently. This is used for sending or receiving normal data and is similar to the putshared STREAMS entry points. Control operations and other special cases call udp become writer() to become exclusive on a per-endpoint basis and this results in transitioning to the UDP SQUEUE mode. squeue by definition serializes access to the conn t. When there are no more pending messages on the squeue for the UDP connection, the endpoint reverts to MT HOT mode. In between when not all MT threads of an endpoint have finished, messages are queued in the endpoint and the UDP is in one of two transient modes, i.e. UDP MT QUEUED or UDP QUEUED SQUEUE mode.

While in stable modes, UDP keeps track of the number of threads operating on the endpoint. The udp reader count variable represents the number of threads entering the endpoint as readers while it is in UDP MT HOT mode. Transitioning to UDP SQUEUE happens when there is only a single reader, i.e. when this counter drops to 1. Likewise, udp squeue count represents the number of threads operating on the endpoint's squeue while it is in UDP SQUEUE mode. The mode transition to UDP MT HOT happens after the last thread exits the endpoint.

Though UDP and IP are running in the same protection domain, they are still separate STREAMS modules. Therefore, STREAMS plumbing is kept unchanged and a UDP module instance is always pushed above IP. Although this causes an extra open and close for every UDP endpoint, it provides backwards compatibility for some applications that rely on such plumbing geometry to do certain things, e.g. issuing I POP on the stream to obtain direct access to IP9.

The actual UDP processing is done within the IP instance. The UDP module instance does not possess any state about the endpoint and merely acts as a dummy module, whose presence is to keep the STREAMS plumbing appearance unchanged.

Solaris 10 allows for the following plumbing modes:
  • Normal - IP is first opened and later UDP is pushed directly on top. This is the default action that happens when a UDP socket or device is opened.
  • SNMP - UDP is pushed on top of a module other than IP. When this happens it will support only SNMP semantics.
These modes imply that we don't support any intermediate module between IP and UDP; in fact, Solaris has never supported such scenario in the past as the inter-layer communication semantics between IP and transport modules are private.

UDP and Socket interaction

A significant event that takes place during socket(3SOCKET) system call is the plumbing of the modules associated with the socket's address family and protocol type. A TCP or UDP socket will most likely result in sockfs residing directly atop the corresponding transport module. Pre Solaris 10, Socket layer used STREAMs primitives to communicate with UDP module. Solaris 10 allowed for a functionally callable interface which eliminated the need to use T UNITDATA REQ message for metadata during each transmit from sockfs to UDP. Instead, data and its ancillary information (i.e. remote socket address) could be provided directly to an alternative UDP entry point, therefore avoiding the extra allocation cost.

For transport modules, being directly beneath sockfs allows for synchronous STREAMS to be used. This enables the transport layer to buffer incoming data to be later retrieved by the application (via synchronous STREAMS) when a read operation is issued, therefore shortening the receive processing time.

Synchronous STREAMS

Synchronous STREAMS is an extension to the traditional STREAMS interface for message passing and processing. It was originally added as part of the combined copy and checksum effort. It offers a way for the entry point of the module or driver to be called in synchronous manner with respect to user I/O request. In traditional STREAMS, the stream head is the synchronous barrier for such request. Synchronous STREAMS provides a mechanism to move this barrier from the stream head down to a module below.
The TCP implementation of synchronous STREAMS in pre Solaris 10 was complicated, due to several factors. A major factor was the combined checksum and copyin/copyout operations. In Solaris 10, TCP wasn't dependent on checksum during copyin/copyout, so the mechanism was greatly simplified for use with loopback TCP and UDP on the read side. The synchronous STREAMS entry points are called during requests such as read(2) or recv(3SOCKET). Instead of sending the data upstream using putnext(9F), these modules enqueue the data in their internal receive queues and allow the send thread to return sooner. This avoids calling strrput() to enqueue the data at the stream head from within the send thread context, therefore allowing for better dynamics - reducing the amount of time taken to enqueue and signal/poll-notify the receiving application allows the send thread to return faster to do further work, i.e. things are less serialized than before.

Each time data arrives, the transport module schedules for the application to retrieve it. If the application is currently blocked (sleeping) during a read operation, it will be unblocked to allow it to resume execution. This is achieved by calling STR WAKEUP SET() on the stream. Likewise, when there is no more data available for the application, the transport module will allow it to be blocked again during the next read attempt, by calling STR WAKEUP CLEAR(). Any new data that arrives before then will override this state and cause subsequent read operation to proceed.

An application may also be blocked in poll(2) until a read event takes place, or it may be waiting for a SIGPOLL or SIGIO signal if the socket used is non-blocking. Because of this, the transport module delivers the event notification and/or signals the application each time it receives data. This is achieved by calling STR SENDSIG() on the corresponding stream.

As part of the read operation, the transport module delivers data to the application by returning it from its read side synchronous STREAMS entry point. In the case of loopback TCP, the synchronous STREAM read entry point returns the entire content (byte stream) of its receive queue to the stream head; any remaining data will be re-enqueued at the stream head awaiting the next read. For UDP, the read entry point returns only one message (datagram) at a time.

STREAMs fallback

By default, direct transmission and read side synchronous STREAMS optimizations are enabled for all UDP and loopback TCP sockets when sockfs is directly above the corresponding transport module. There are several cases which require these features to be disabled; when this happens, message exchange between sockfs and the transport module must then be done through putnext(9F). The cases are described as follows -
  • Intermediate Module - A module is configured to be autopushed at open time on top of the transport module via autopush(1M), or is I PUSH'd on a socket via ioctl(2).
  • Stream Conversion - The imaginary sockmod module is I POP'd from a socket causing it to be converted from a socket endpoint into a device stream.
(Note that I INSERT or I REMOVE ioctl is not permitted on a socket endpoint and therefore a fallback is not required to handle it.)

If a fallback is required, sockfs will notify the transport module that direct mode is disabled. The notification is sent down by the sockfs module in the form of an ioctl message, which indicates to the transport module that putnext(9F) must now be used to deliver data upstream. This allows for data to flow through the intermediate module and it provides for compatibility with device stream semantics.

5 IP

As mentioned before, all the transport layers have been merged in IP module which is fully multithreaded and acts as a pseudo device driver as well a STREAMs module. The key change in IP was the removal IP client functionality and multiplexing the inbound packet stream. The new IP Classifier (which is still part of IP module) is responsible for classifying the inbound packets to the correct connection instance. IP module is still responsible for network layer protocol processing and plumbing and managing the network interfaces.
Lets have a quick look at how plumbing of network interfaces, multi pathing, and multicast works in the new stack.

Plumbing NICs

Plumbing is a long sequence of operations involving message exchanges between IP, ARP and device drivers. Most set ioctls are typically involved in plumbing operations. A natural model is to serialize these ioctls one per ill. For example plumbing of hme0 and qfe0 can go on in parallel without any interference. But various set ioctls on hme0 will all be serialized.

Another possibility is to fine-grain even further and serialize operations per ipif rather than per ill. This will be beneficial only if many ipifs are hosted on an ill, and if the operations on different ipifs don't have any mutual interference. Another possibility is to completely multithread all ioctls using standard Solaris MT techniques. But this is needlessly complex and does not have much added value. It is hard to hold locks across the entire plumbing sequence, which involves waits, and message exchanges with drivers or other modules. Not much is gained in performance or functionality by simultaneously allowing multiple set ioctls on an ipif at the same time since these are purely non-repetitive control operations. Broadcast ires are created on a per ill basis rather than per ipif basis. Hence trying to bring up more than 1 ipif simultaneously on an ill involves extra complexity in the broadcast ire creation logic. On the other hand serializing plumbing operations per ill lends itself easily to the existing IP code base. During the course of plumbing IP exchanges messages with the device driver and ARP. The messages received from the underlying device driver are also handled exclusively in IP. This is convenient since we can't hold standard mutex locks across the putnext in trying to provide mutual exclusion between the write side and read side activities. Instead of the all exclusive PERMOD syncq, this effect can be easily achieved by using a per ill serialization queue.

IP Network MultiPathing (IPMP)

IPMP operations are all driven around the notion of an IPMP group. Failover and Failback operations operate between 2 ills, usually part of the same IPMP group. The ipifs and ilms are moved between the ills. This involves bringing down the source ill and could involve bringing up the destination ill. Bringing down or bringing up ills affect broadcast ires. Broadcast ires need to be grouped per IPMP group to suppress duplicate broadcast packets that are received. Thus broadcast ire manipulation affects all members of the IPMP group. Setting IFF_FAILED or IFF_STANDBY causes evaluation of all ills in the IPMP group and causes regrouping of broadcast ires. Thus serializing IPMP operations per IPMP group lends itself easily to the existing code base. An IPMP group includes both the IPv4 and IPv6 ills.


Multicast joins operate on both the ilg and ilm structures. Multiple threads operating on an ipc (socket) trying to do multicast joins need to synchronize when operating on the ilg. Multiple threads potentially operating on different ipcs (socket endpoints) trying to do multicast joins could eventually end up trying to manipulate the ilm simultaneously and need to synchronize on the access to the ilm. Both are amenable to standard Solaris MT techniques. Considering all the above, i.e. plumbing, IPMP and multicast, the common denominator is to serialize all the exclusive operations on a per IPMP group basis. If IPMP is not enabled, then on a phyint basis. E.g. hme0 v4 and hme0 v6 ills taken together share a phyint. In the above multicast has a potential higher degree of multithreading. But it has to coexist with other exclusive operations. For example we don't want a thread to create or delete an ilm when a failover operation is already in progress trying to move ilms between 2 ills. So the lowest common denominator is to serialize multicast joins per physical interface or IPMP group.


Sunay, this is excellent info! thx for putting it all together (and yes, I will still buy the book ;) ) . reg//ulf

Posted by Ulf Andreasson on November 15, 2005 at 03:30 AM PST #

It is worth mentioning that when the decision to move to STREAMS was made, it was not clear that IP would win in the marketplace. Another advantage of STREAMS was that it was fairly easy to have the software layers follow a network layer model. Of course, at the time the 7-layer ISO model was competing against the 4-layer IP model. For some reason IP won, yet we still try to describe it with 7 layers. Go figure. Anyway, in retrospect, it would have been easier to ignore anything but IP, like subsequently designed OSes have been able to do.
-- richard

Posted by Richard Elling on December 07, 2005 at 05:33 AM PST #

Post a Comment:
Comments are closed for this entry.

Sunay Tripathi, Sun Distinguished Engineer, Solaris Core OS, writes a weblog on architecture for Solaris Networking Stack, GLDv3 (Nemo) framework, Crossbow Network Virtualization and related things


« December 2016

No bookmarks in folder

Solaris Networking: Magic Revealed

No bookmarks in folder

solaris networking