Solaris Networking - The Magic Revealed (Part II)

Solaris Networking - The Magic Revealed

Many of you have asked for details on Solaris 10 networking.  The
great news is that I finished writing the treatise on the subject which
will become a new section in href="http://solarisinternals.com/si/index.php">Solaris Internals book
by Jim Mauro and href="http://blogs.sun.com/roller/page/rmc">Richard Mcdougall. 
In the meawhile, I have used some excerpts to create a mini book (part
II) for Networking
community on OpenSolaris
. Enjoy! As usual, comments
(good or bad) are welcome.

Solaris Networking - The Magic Revealed
(Part II)

6. Solaris 10 Device
Driver framework

Lets have a quick look at how Network device drivers were implemented
pre Solaris 10 and why they need to change with the new Solaris 10

GLDv2 and
Monolithic DLPI drivers (Solaris 9 and before)

Pre Solaris 10, network stack relays on DLPI1 providers, which are
normally implemented in one of two ways. The following illustrations
(Fig 5) show a stack based on a so-called monolithic DLPI driver and a
stack based on a driver utilizing the Generic LAN Driver (GLDv2)

Fig. 5style="width: 463px; height: 485px;">

Fig. 5

The GLDv2 module essentially behaves as a library. The client still
talks to the driver instance bound to the device but the DLPI protocol
processing is handled by calling into the GLDv2 module, which will then
call back into the driver to access the hardware. Using the GLD module
has a clear advantage in that the driver writer need not re-implement
large amounts of mostly generic DLPI protocol processing. Layer two
(Data-Link) features such as 802.1q Virtual LANs (VLANs) can also be
implemented centrally in the GLD module allowing them to be leveraged
by all drivers. The architecture still poses a problem though when
considering how to implement a feature such as 802.3ad link aggregation
(a.k.a. trunking) where the one-to-one correspondence between network
interface and device is broken.

Both GLDv2 and monolithic driver depend on DLPI messages and
communicated with upper layers via STREAMs framework. This mechanism
was not very effective for link aggregation or 10Gb NICs. With the new
stack, a better mechanism was needed which could ensure data locality
and allow the stack to control the device drivers at much finer
granularity to deal with interrupts.

GLDv3 - A New

Solaris 10 introduced a new device driver framework called GLDv3
(internal name "project Nemo") along with the new stack. Most of the
major device drivers were ported to this framework and all future and
10Gb device drivers will be based on this framework. This framework
also provided a STREAMs based DLPI layer for backword compatibility (to
allow external, non-IP modules to continue to work).

GLDv3 architecture virtualizes layer two of the network stack. There is
no longer a one-to-one correspondence between network interfaces and
devices. The illustration below (Fig. 6) shows multiple devices
registered with a MAC Services Module (MAC). It also shows two clients:
one traditional client that communicates via DLPI to a Data-Link Driver
(DLD) and one that is kernel based and simply makes direct function
calls into the Data-Link Services Module (DLS).
Fig. 6src="//cdn.app.compendium.com/uploads/user/e7c690e8-6ff9-102a-ac6d-e4aebca50425/f4a5b21d-66fa-4885-92bf-c4e81c06d916/Image/cea66706f59e1d7dbc89dd6b15f9d7f7/fig6"
style="width: 406px; height: 549px;">

Fig. 6

GLDv3 Drivers

GLDv3 drivers are similar to GLD drivers. The driver must be linked
with a dependency on misc/mac. and misc/dld. It must call
mac_register() with a pointer to an instance of the following structure
to register with the MAC module:

typedef struct mac {
const char \*m_ident;
mac_ext_t \*m_extp;
struct mac_impl \*m_impl;
void \*m_driver;
dev_info_t \*m_dip;
uint_t m_port;
mac_info_t m_info;
mac_stat_t m_stat;
mac_start_t m_start;
mac_stop_t m_stop;
mac_promisc_t m_promisc;
mac_multicst_t m_multicst;
mac_unicst_t m_unicst;
mac_resources_t m_resources;
mac_ioctl_t m_ioctl;
mac_tx_t m_tx;
} mac_t;

This structure must persist for the lifetime of the registration, i.e.
it cannot be de-allocated until after mac_unregister() is called. A
GLDv3 driver _init(9E) entry point is also required to call
mac_init_ops() before calling mod_install(9F), and they are required to
call mac_fini_ops() after calling mod_remove(9F) from _fini(9E).

The important members of this 'mac_t' structure are:
  • 'm_impl' - This is used by the MAC module to point to its private
    data. It must not be read or
    modified by a driver.

  • 'm_driver' - This field should be set by the driver to point at
    its private data. This
    value will be supplied as the first argument to the driver entry points.

  • 'm_dip' - This field must be set to the dev_info_t pointer of the
    driver instance calling

  • 'm_stat' -
     typedef uint64_t (\*mac_stat_t)(void \*, mac_stat_t);

    This entry point is called to retrieve a value for one of the
    statistics defined in the
    mac_stat_t enumeration (below). All values should be stored and
    in 64-bit unsigned integers. Values will not be requested for
    statistics that the driver has not explicitly declared to be supported.

  • 'm_start' -
     typedef int (\*mac_start_t)(void \*);

    This entry point is called to bring the device out of the
    reset/quiesced state that it was in when the interface was registered.
    No packets will be submitted by the MAC module for
    transmission and no packets should be submitted by the driver for
    reception before this call is made. If this function succeeds then zero
    should be returned. If it fails then an appropriate errno value should
    be returned.
  • 'm_stop' -
     typedef void (\*mac_stop_t)(void \*);

    This entry point should stop the device and put it in a reset/quiesced
    state such that the interface can be unregistered. No packets will be
    submitted by the MAC for transmission once this call has been made and
    no packets should be submitted by the driver for reception once it has
  • 'm_promisc' -
     typedef int (\*mac_promisc_t)(void \*, boolean_t);

    This entry point is used to set the promiscuity of the device. If the
    second argument is B_TRUE then the device should receive all packets on
    the media. If it is set to B_FALSE then only packets destined for the
    device's unicast address and the media broadcast address should be
  • 'm_multicst' -
     typedef int (\*mac_multicst_t)(void \*, boolean_t, const uint8_t \*);

    This entry point is used to add and remove addresses to and from the
    set of multicast addresses for which the device will receive packets.
    If the second argument is B_TRUE then the address pointed to by the
    third argument should be added to the set. If the second argument is
    B_FALSE then the address pointed to by the third argument should be
  • 'm_unicst' -
     typedef int (\*mac_unicst_t)(void \*, const uint8_t \*);

    This entry point is used to set a new device unicast address. Once this
    call is made then only packets with the new address and the media
    broadcast address should be received unless the device is in

    promiscuous mode.
  • 'm_resources' -
     typedef void (\*mac_resources_t)(void \*, boolean_t);

    This entry point is called to request that the driver register its
    individual receive resources or Rx rings.
  • 'm_tx' -
     typedef mblk_t \*(\*mac_tx_t)(void \*, mblk_t \*);

    This entry point is used to submit packets for transmission by the
    device. The second argument points to one or more packets contained in
    mblk_t structures. Fragments of the same packet will be linked together
    using the b_cont field. Separate packets will be linked by the b_next
    field in the leading fragment. Packets should be scheduled for
    transmission in the order in which they appear in the chain. Any
    remaining chain of packets that cannot be scheduled should be returned.
    If m_tx() does return packets that cannot be scheduled the driver must
    call mac_tx_update() when resources become available. If all packets
    are scheduled for transmission then NULL should be returned.
  • 'm_info' - This is an embedded structure defined as follows:
     typedef struct mac_info {
    uint_t mi_media;
    uint_t mi_sdu_min;
    uint_t mi_sdu_max;
    uint32_t mi_cksum;
    uint32_t mi_poll;
    boolean_t mi_stat[MAC_NSTAT];
    uint_t mi_addr_length;
    uint8_t mi_unicst_addr[MAXADDRLEN];
    uint8_t mi_brdcst_addr[MAXADDRLEN];
    } mac_info_t;

    mi_media is set of be the media type; mi_sdu_min is the minimum payload
    size; mi_sdu_max is the maximum payload size; mi_cksum details the
    device cksum capabilities flag; mi_poll details if the driver supports
    polling; mi_addr_length is set to the length of the addresses used by
    the media; mi_unicst_addr is set with the unicast address of the device
    at the point at which mac_register() is called;mi_brdcst_addr is set to
    the broadcast address of the media; mi_stat is an array of boolean
    typedef enum {


    MAC_NSTAT /\* must be the last entry \*/
    } mac_stat_t;

    The macros MAC_MIB_SET(), MAC_ETHER_SET() and MAC_MII_SET() are
    provided to set all the values in each of the three groups respectively
    to B_TRUE.

Services (MAC) module

Some key Driver Support Functions:
  • 'mac_resource_add' -

    extern mac_resource_handle_t mac_resource_add(mac_t \*, mac_resource_t \*);

    Various members are defined as

     typedef void (\*mac_blank_t)(void \*, time_t, uint_t);
    typedef mblk_t \*(\*mac_poll_t)(void \*, uint_t);

    typedef enum {
    MAC_RX_FIFO = 1
    } mac_resource_type_t;

    typedef struct mac_rx_fifo_s {
    mac_resource_type_t mrf_type; /\* MAC_RX_FIFO \*/
    mac_blank_t mrf_blank;
    mac_poll_t mrf_poll;
    void \*mrf_arg;
    time_t mrf_normal_blank_time;
    uint_t mrf_normal_pkt_cnt;
    } mac_rx_fifo_t;

    typedef union mac_resource_u {
    mac_resource_type_t mr_type;
    mac_rx_fifo_t mr_fifo;
    } mac_resource_t;

    This function should be called from the m_resources() entry point to
    register individual receive resources (commonly ring buffers of DMA
    descriptors) with the MAC module. The returned mac_resource_handle_t
    value should then be supplied in calls to mac_rx(). The second argument
    to mac_resource_add() specifies the resource being added. Resources are
    specified by the mac_resource_t structure. Currently only resources of
    type MAC_RX_FIFO are supported. MAC_RX_FIFO resources are described by
    the mac_rx_fifo_t structure.

    This mac_blank function is meant to be used by upper layers to control
    the interrupt rate of the device. The first argument is the device
    context meant to be used as the first argument to poll_blank.

    The other fields mrf_normal_blank_time and mrf_normal_pkt_cnt specify
    the default interrupt interval and packet count threshold,
    respectively. These parameters may be used as the second and third
    arguments to mac_blank when the upper layer wants the driver to revert
    to the default interrupt rate.

    The interrupt rate is controlled by the upper layer by calling
    poll_blank with different arguments. The interrupt rate can be
    increased or decreased by the upper layer by passing a multiple of
    these values to the last two arguments of mac_blank. Setting these
    avlues to zero disables the interrupts and NIC is deemed to be in
    polling mode.

    The mac_poll is the driver supplied function is used by upper layer to
    retrieve a chain of packets (upto max count specified by second
    argument) from the Rx ring corresponding to the earlier supplied
    mrf_arg during mac_resource_add (supplied as first argument to
  • 'mac_resource_update' -
     extern void mac_resource_update(mac_t \*);

    Invoked by the driver when the available resources have changed.

  • 'mac_rx' -
     extern void mac_rx(mac_t \*, mac_resource_handle_t, mblk_t \*);

    This function should be called to deliver a chain of packets, contained
    in mblk_t structures, for reception. Fragments of the same packet
    should be linked together using the b_cont field. Separate packets
    should be linked using the b_next field of the leading fragment. If the
    packet chain was received by a registered resource then the appropriate
    mac_resource_handle_t value should be supplied as the second argument
    to the function. The protocol stack will use this value as a hint when
    trying to load-spread across multiple CPUs. It is assumed that packets
    belonging to the same flow will always be received by the same
    resource. If the resource is unknown or is unregistered then NULL
    should be passed as the second argument.

Data-Link Services
(DLS) Module

The DLS module provides Data-Link Services interface analogous to DLPI.
The DLS interface is a kernel-level functional interface as opposed to
the STREAMS message based interface specified by DLPI. This module
provides the interfaces necessary for upper layer to create and destroy
a dala link service; It also provides the interfaces necessary to plumb
and unplumb the NIC. The plumbing and unplumbing of NIC for GLDv3 based
device drivers is unchanged from the older GLDv2 or monolithic DLPI
device drivers. The major changes are in data paths which allow direct
calls, packet chains and much finer grained control over NIC.

Driver (DLD)

The Data-Link Driver provides a DLPI using the interfaces provided by
the DLS and MAC modules. The driver is configured using IOCTLs passed
to a control node. These IOCTLs create and destroy separate DLPI
provider nodes. This module deals with DLPI messages necessary to
plumb/unplumb the NIC and provides the backward compatibility for data
path via STREAMs for non GLDv3 aware clients.

GLDv3 Link
aggregation architecture

The GLDv3 framework provides support for Link Aggregation as defined by
IEEE 802.3ad. The key design principles while designing this facility
  • Allow GLDv3 MAC drivers to be aggregated without code changestyle="font-family: monospace;">
  • The performance of
    non-aggregated devices must be preservedstyle="font-family: monospace;">
  • The performance of
    aggregated devices should be cumulative of linestyle="font-family: monospace;"> rate for each member i.e.
    minimal overheads due to aggregationstyle="font-family: monospace;">
  • Support both manual
    configuration and Link Aggregation Controlstyle="font-family: monospace;"> protocol (LACP)

GLDv3 link aggregation is implement by means of a pseudo driver called
'aggr'. It registers virtual ports corresponding to link aggregation
groups with the GLDv3 Mac layer. It uses the client interface provided
by MAC layer to control and communicate with aggregated MAC ports as
illustrated below in Fig 7. It also export a pseudo 'aggr' device
driver which is used by 'dladm' command to configure and control the
link aggregated interface. Once a MAC port is configured to be part of
link aggregation group, it cannot be simultaneously accessed by other
MAC clients clients such as DLS layer. The exclusive access is enforced
by the MAC layer. The implementation of LACP is implemented by the
'aggr' driver which has access to individual MAC ports or links.

Fig 7src="//cdn.app.compendium.com/uploads/user/e7c690e8-6ff9-102a-ac6d-e4aebca50425/f4a5b21d-66fa-4885-92bf-c4e81c06d916/Image/044442d7995586e7c00c491073e0f07f/fig7"
style="width: 415px; height: 636px;">

Fig. 7

The GLDv3 aggr driver acts a normal MAC module to upper layer and
appears as a standard NIC interface which once created with 'dladm',
can be configured and managed by 'ifconfig'. The 'aggr' module
registers each MAC port which is part of the aggregation with the upper
layer using the 'mac_resource_add' function such that the data paths
and interrupts from each MAC port can be independently managed by the
upper layers (see Section 8b). In short, the aggregated interface is
managed as a single interface with possibly one IP address and the data
paths are managed as individual NICs by unique CPUs/Squeues providing
aggregation capability to Solaris with near zero overheads and linear
scalability with respect to number of MAC ports that are part of the
aggregation group.

Checksum offload

Solaris 10 improved the H/W checksum offload capability further to
improve overall performance for most applications. 16-bit one's
complement checksum offload framework has existed in Solaris for some
time. It was originally added as a requirement for Zero Copy TCP/IP in
Solaris 2.6 but was never extended until recently to handle other
protocols. Solaris defines two classes of checksum offload:
  • Full - Complete checksum calculation in the hardware, including
    pseudo-header checksum computation for TCP and UDP packets. The
    hardware is assumed to have the ability to parse protocol headers.
  • Partial - "Dumb" one's complement checksum based on start, end
    and stuff offsets describing the span of the checksummed data and the
    location of the transport checksum field, with no pseudo-header
    calculation ability in the hardware.

Adding support for non-fragmented IPV4 cases (unicast or multicast) is
trivial for both transmit and receive, as most modern network adapters
support either class of checksum offload with minor differences in the

interface. The IPV6 cases are not as straightforward, because very few
full-checksum network adapters are capable of handling checksum
calculation for TCP/UDP packets over IPV64.

The fragmented IP cases have similar constraints. On transmit,
checksumming applies to the unfragmented datagram. In order for an
adapter to support checksum offload, it must be able to buffer all of
the IP fragments (or perform the fragmentation in hardware) before
finally calculating the checksum and sending the fragments over the
wire; until then, checksum offloading for outbound IP fragments cannot
be done. On the other hand, the receive fragment reassembly case is
more flexible since most full-checksum (and all partial-checksum)
network adapters are able to compute and provide the checksum value to
the network stack. During fragment reassembly stage, the network stack
can derive the checksum status of the unfragmented datagram by
combining the values altogether.

Things were simplified by not offloading checksum when IP option were
present. For partial-checksum offload, certain adapters limit the start
offset to a width sufficient for simple IP packets. When the length of
protocol headers exceeds such limit (due to the presence of options),
the start offset will wrap around causing incorrect calculation. For
full-checksum offload, none of the capable adapters is able to
correctly handle IPV4 source routing option.

When transmit checksum offload takes place, the network stack will
associate eligible packets with ancillary information needed by the
driver to offload the checksum computation to hardware.

In the inbound case, the driver has full control over the packets that
get associated with hardware-calculated checksum values. Once a driver
advertises its capability via DL CAPAB HCKSUM, the network stack will
accept full and/or partial-checksum information for IPV4 and IPV6
packets. This process happens for both non-fragmented and fragmented

Fragmented packets will first need to go through the reassembly process
because checksum validation happens for fully reassembled datagrams.
During reassembly, the network stack combines the hardware-calculated
checksum value of each fragment.

'dladm' - New
command for datalink administration

Over period of time, 'ifconfig' has become severely overloaded trying
to manage various layers in the stack. Solaris 10 introduced 'dladm'
command to manage the data link services and ease the burden on
'ifconfig'. The dladm command operates on three kinds of object:
  • 'link' - Data-links, identified by a name
  • 'aggr' - Aggregations of network devices, identified by a key

  • 'dev' - Network devices, identified by concatenation of a driver
    name and an instance number.

The key of an aggregation must be an integer value between 1 and 65535.
Some devices do not support configurable data-links or aggregations.
The fixed data-links provided by such devices can be viewed using dladm
but not configured.

The GLDv3 framework allows users to select the outbound load balancing
policy across various members of aggregation while configuring the
aggregation. The policy specifies which dev object is used to send
packets. A policy consists of a list of one or more layers specifiers
separated by commas. A layer specifier is one of the following:
  • L2 - Select outbound device according to source and destination
    MAC addresses of the packet.
  • L3 - Select outbound device according to source and destination
    IP addresses of the packet.
  • L4 - Select outbound device according to the upper layer protocol
    information contained in the packet. For TCP and UDP, this includes
    source and destination ports. For IPsec, this includes the SPI
    (Security Parameters Index.)

For example, to use upper layer protocol information, the following
policy can be used:

            -P L4

To use the source and destination MAC addresses as well as the source
and destination IP addresses, the following policy can be used:

            -P L2,L3

The framework also supports Link aggregation control protocol (LACP)
for GLDv3 based aggregations which can be controlled by 'dladm' via
the  'lacp-mode' and 'lacp-timer' sub commands. The 'lacp-mode'
can be set to 'off', 'active' or 'passive'.

When a new device is inserted into a system. During reconfiguration
boot or DR a default non-VLAN data-link will be created for the device.
The configuration of all objects will persist across reboot.

In future, 'dladm' and its private file where all persistant
information is stored ('/etc/datalink.conf') will be used to manage
device specific parameters which are currently managed via 'ndd',
driver specific configuration files and /etc/system.

7. Tuning for

The Solaris 10 stack is tuned to give steller out of box performance
irrespective of the H/W used. The secret lies in using techniques like
dynamically switching between interrupt vs polling mode which gives
very good latencies when load is managible by allowing the NIC to
interrupt per packet and switching to polling mode for better
throughput and well bounded latencies when load is very high. The
defaults are also carefully picked based on H/W configuration. For
instance, the 'tcp_conn_hash_size' tunable was very conservative pre
Solaris 10. The default value of 512 hash buckets was selected based on
lowest supperted configuration (in terms of memory). Solaris 10 looks
at the free memory at boot time to choose the value for
'tcp_conn_hash_size'. Similarly, when connection is 'reaped' from the
time wait state, the memory associated with the connection instance is
not freed instantly (again based on the total system memory available)
but instead put in a 'free_list'. When new connections arrive if a
given period, TCP tries to reuse memory from 'free_list' otehr wise
'free_list' is periodically cleaned up.

Inspite of these features, sometimes its necessary to tweak some
tunables to deal with extreme cases or specific workloads. We discuss
some tunables below that control the stack behaviour. Care should be
taken to understand the impact otherwise the system might become
unstable. Its important to note that for bulk of the applications and
workloads, the defaults will give the best results.
  • 'ip_squeue_fanout' - Controls whether incoming connections from
    one NIC are fanned out across all CPUs. A value of 0 means incoming
    connections are assigned to the squeue attached to the interrupted CPU.
    A value of 1 means the connections are fanned out across all CPUs. The
    latter is required when NIC is faster than the CPU (say 10Gb NIC) and
    multiple CPU need to service the NIC. Set via /etc/system by adding the
    following line
     set ip:ip_squeue_fanout=1

  • 'ip_squeue_bind' - Controls whether worker threads are bound to
    specific CPUs or not. When bound (default), they give better locality.
    The non default value (don't bind) is often chosen only when processor
    sets are to be created on the system. Unset via /etc/system by adding
    the following line
     set ip:ip_squeue_bind=0

  • 'tcp_squeue_wput' - controls the write side squeue drain behavior.
    • 1 - Try to process your own packets but don't try to drain
    • 2 - Try to process your own packet as well as any queued

    The default value is 2 and can be changed via /etc/system by adding
     set ip:tcp_squeue_wput=1

    This value should be set to 1 when
    number of CPUs are far more than number of active NICs and the platform
    has inherently higher memory latencies where chances of an application
    thread doing squeue drain and getting pinned is high.

  • 'ip_squeue_wait' - Controls the amount of time in 'ms' a worker
    thread will wait before processing queued packets in the hope that
    interrupt or writer thread will process the packet. For servers which
    see enough traffic, the default of 10ms is good but for machines which
    see more interactive traffic (like desktops) where latency is an issue,
    the value should be set to 0 via /etc/system by adding set

  • In addition, some protocol level tuning like changing the
    max_buf, high and low water mark, etc if beneficial specially on large
    memory systems.

8. Future

The future direction of Solaris networking stack will continue to
on better vertical integration between layers which will improve
locality and performance further. With the advent of Chip
multithreading and multi core CPUs, the number of parallel execution
pipelines will continue to increase even on low end systems. A typical
2 CPU machine today is dual core providing 4 execution pipelines and
soon going to have hyperthreading as well.

The NICs are also becoming advanced offering multiple interrupts via
MSI-X, small classification capabilities, multiple DMA channels, and
various stateless offloads like large segment offload etc.

Future work will continue to leverage on these H/W trends including
support for TCP offload engines, Remote direct memory access (RDMA),
and iSCSI.  Some other specific things that are being worked on:
  • Network stack virtualization - With the industry wide trend of
    server consolidation and running multiple virtual machines on same
    physical instance, its important the Solaris stack can be virtualized
  • B/W Resource control - The same trend thats driving network
    virtualization is also driving the need to control the bandwidth usage
    for various applications and virtual machines on same box efficiently.

  • Support for high performance 3rd party modules - The current
    Solaris 10 framework is still private to modules from Sun. STREAMs
    based modules are the only option for the ISVs and they miss the full
    potential of the new framework.
  • Forwarding performance - Work is being done to further improve
    the Solaris forwarding performance.
  • Network security with performance - The world is becoming
    increasing complex and hostile. Its not possible to choose between
    performance and security anymore. Both are a requirement. Solaris was
    always very strong in security and Solaris 10 makes great strides in
    enabling security without sacrificing performance. Focus will continue
    on enhancing IPfilter performance and functionality and a whole new
    approach and detecting Denial of service attacks and dealing with them.

9. Acknowledgments

Many Thanks to Thirumalai Srinivasan, Adi Masputra, Nicolas Droux, and
Eric Cheng for contributing parts of this text. Also thanks are due to
all the members of Solaris networking community for their help.

Join the discussion

Comments ( 1 )
  • guest Thursday, February 2, 2006
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha

Integrated Cloud Applications & Platform Services