Solaris Networking - The Magic Revealed (Part II)

Solaris Networking - The Magic Revealed Many of you have asked for details on Solaris 10 networking.  The great news is that I finished writing the treatise on the subject which will become a new section in Solaris Internals book by Jim Mauro and Richard Mcdougall.  In the meawhile, I have used some excerpts to create a mini book (part II) for Networking community on OpenSolaris. Enjoy! As usual, comments (good or bad) are welcome.

Solaris Networking - The Magic Revealed (Part II)

6. Solaris 10 Device Driver framework

Lets have a quick look at how Network device drivers were implemented pre Solaris 10 and why they need to change with the new Solaris 10 stack.

GLDv2 and Monolithic DLPI drivers (Solaris 9 and before)

Pre Solaris 10, network stack relays on DLPI1 providers, which are normally implemented in one of two ways. The following illustrations (Fig 5) show a stack based on a so-called monolithic DLPI driver and a stack based on a driver utilizing the Generic LAN Driver (GLDv2) module.

Fig. 5

Fig. 5

The GLDv2 module essentially behaves as a library. The client still talks to the driver instance bound to the device but the DLPI protocol processing is handled by calling into the GLDv2 module, which will then call back into the driver to access the hardware. Using the GLD module has a clear advantage in that the driver writer need not re-implement large amounts of mostly generic DLPI protocol processing. Layer two (Data-Link) features such as 802.1q Virtual LANs (VLANs) can also be implemented centrally in the GLD module allowing them to be leveraged by all drivers. The architecture still poses a problem though when considering how to implement a feature such as 802.3ad link aggregation (a.k.a. trunking) where the one-to-one correspondence between network interface and device is broken.

Both GLDv2 and monolithic driver depend on DLPI messages and communicated with upper layers via STREAMs framework. This mechanism was not very effective for link aggregation or 10Gb NICs. With the new stack, a better mechanism was needed which could ensure data locality and allow the stack to control the device drivers at much finer granularity to deal with interrupts.

GLDv3 - A New Architecture

Solaris 10 introduced a new device driver framework called GLDv3 (internal name "project Nemo") along with the new stack. Most of the major device drivers were ported to this framework and all future and 10Gb device drivers will be based on this framework. This framework also provided a STREAMs based DLPI layer for backword compatibility (to allow external, non-IP modules to continue to work).

GLDv3 architecture virtualizes layer two of the network stack. There is no longer a one-to-one correspondence between network interfaces and devices. The illustration below (Fig. 6) shows multiple devices registered with a MAC Services Module (MAC). It also shows two clients: one traditional client that communicates via DLPI to a Data-Link Driver (DLD) and one that is kernel based and simply makes direct function calls into the Data-Link Services Module (DLS).
Fig. 6

Fig. 6

GLDv3 Drivers

GLDv3 drivers are similar to GLD drivers. The driver must be linked with a dependency on misc/mac. and misc/dld. It must call mac_register() with a pointer to an instance of the following structure to register with the MAC module:

typedef struct mac {
const char \*m_ident;
mac_ext_t \*m_extp;
struct mac_impl \*m_impl;
void \*m_driver;
dev_info_t \*m_dip;
uint_t m_port;
mac_info_t m_info;
mac_stat_t m_stat;
mac_start_t m_start;
mac_stop_t m_stop;
mac_promisc_t m_promisc;
mac_multicst_t m_multicst;
mac_unicst_t m_unicst;
mac_resources_t m_resources;
mac_ioctl_t m_ioctl;
mac_tx_t m_tx;
} mac_t;

This structure must persist for the lifetime of the registration, i.e. it cannot be de-allocated until after mac_unregister() is called. A GLDv3 driver _init(9E) entry point is also required to call mac_init_ops() before calling mod_install(9F), and they are required to call mac_fini_ops() after calling mod_remove(9F) from _fini(9E).

The important members of this 'mac_t' structure are:
  • 'm_impl' - This is used by the MAC module to point to its private data. It must not be read or modified by a driver.
  • 'm_driver' - This field should be set by the driver to point at its private data. This value will be supplied as the first argument to the driver entry points.
  • 'm_dip' - This field must be set to the dev_info_t pointer of the driver instance calling mac_register().
  • 'm_stat' -
     typedef uint64_t (\*mac_stat_t)(void \*, mac_stat_t);
    This entry point is called to retrieve a value for one of the statistics defined in the mac_stat_t enumeration (below). All values should be stored and returned in 64-bit unsigned integers. Values will not be requested for statistics that the driver has not explicitly declared to be supported.
  • 'm_start' -
     typedef int (\*mac_start_t)(void \*);
    This entry point is called to bring the device out of the reset/quiesced state that it was in when the interface was registered. No packets will be submitted by the MAC module for transmission and no packets should be submitted by the driver for reception before this call is made. If this function succeeds then zero should be returned. If it fails then an appropriate errno value should be returned.
  • 'm_stop' -
     typedef void (\*mac_stop_t)(void \*);
    This entry point should stop the device and put it in a reset/quiesced state such that the interface can be unregistered. No packets will be submitted by the MAC for transmission once this call has been made and no packets should be submitted by the driver for reception once it has completed.
  • 'm_promisc' -
     typedef int (\*mac_promisc_t)(void \*, boolean_t);
    This entry point is used to set the promiscuity of the device. If the second argument is B_TRUE then the device should receive all packets on the media. If it is set to B_FALSE then only packets destined for the device's unicast address and the media broadcast address should be received.
  • 'm_multicst' -
     typedef int (\*mac_multicst_t)(void \*, boolean_t, const uint8_t \*);
    This entry point is used to add and remove addresses to and from the set of multicast addresses for which the device will receive packets. If the second argument is B_TRUE then the address pointed to by the third argument should be added to the set. If the second argument is B_FALSE then the address pointed to by the third argument should be removed.
  • 'm_unicst' -
     typedef int (\*mac_unicst_t)(void \*, const uint8_t \*);
    This entry point is used to set a new device unicast address. Once this call is made then only packets with the new address and the media broadcast address should be received unless the device is in
    promiscuous mode.
  • 'm_resources' -
     typedef void (\*mac_resources_t)(void \*, boolean_t);
    This entry point is called to request that the driver register its individual receive resources or Rx rings.
  • 'm_tx' -
     typedef mblk_t \*(\*mac_tx_t)(void \*, mblk_t \*);
    This entry point is used to submit packets for transmission by the device. The second argument points to one or more packets contained in mblk_t structures. Fragments of the same packet will be linked together using the b_cont field. Separate packets will be linked by the b_next field in the leading fragment. Packets should be scheduled for transmission in the order in which they appear in the chain. Any remaining chain of packets that cannot be scheduled should be returned. If m_tx() does return packets that cannot be scheduled the driver must call mac_tx_update() when resources become available. If all packets are scheduled for transmission then NULL should be returned.
  • 'm_info' - This is an embedded structure defined as follows:
     typedef struct mac_info {
    uint_t mi_media;
    uint_t mi_sdu_min;
    uint_t mi_sdu_max;
    uint32_t mi_cksum;
    uint32_t mi_poll;
    boolean_t mi_stat[MAC_NSTAT];
    uint_t mi_addr_length;
    uint8_t mi_unicst_addr[MAXADDRLEN];
    uint8_t mi_brdcst_addr[MAXADDRLEN];
    } mac_info_t;
    mi_media is set of be the media type; mi_sdu_min is the minimum payload size; mi_sdu_max is the maximum payload size; mi_cksum details the device cksum capabilities flag; mi_poll details if the driver supports polling; mi_addr_length is set to the length of the addresses used by the media; mi_unicst_addr is set with the unicast address of the device at the point at which mac_register() is called;mi_brdcst_addr is set to the broadcast address of the media; mi_stat is an array of boolean values
    typedef enum {


    MAC_NSTAT /\* must be the last entry \*/
    } mac_stat_t;

    The macros MAC_MIB_SET(), MAC_ETHER_SET() and MAC_MII_SET() are provided to set all the values in each of the three groups respectively to B_TRUE.

MAC Services (MAC) module

Some key Driver Support Functions:
  • 'mac_resource_add' -

    extern mac_resource_handle_t mac_resource_add(mac_t \*, mac_resource_t \*);
    Various members are defined as

     typedef void (\*mac_blank_t)(void \*, time_t, uint_t);
    typedef mblk_t \*(\*mac_poll_t)(void \*, uint_t);

    typedef enum {
    MAC_RX_FIFO = 1
    } mac_resource_type_t;

    typedef struct mac_rx_fifo_s {
    mac_resource_type_t mrf_type; /\* MAC_RX_FIFO \*/
    mac_blank_t mrf_blank;
    mac_poll_t mrf_poll;
    void \*mrf_arg;
    time_t mrf_normal_blank_time;
    uint_t mrf_normal_pkt_cnt;
    } mac_rx_fifo_t;

    typedef union mac_resource_u {
    mac_resource_type_t mr_type;
    mac_rx_fifo_t mr_fifo;
    } mac_resource_t;
    This function should be called from the m_resources() entry point to register individual receive resources (commonly ring buffers of DMA descriptors) with the MAC module. The returned mac_resource_handle_t value should then be supplied in calls to mac_rx(). The second argument to mac_resource_add() specifies the resource being added. Resources are specified by the mac_resource_t structure. Currently only resources of type MAC_RX_FIFO are supported. MAC_RX_FIFO resources are described by the mac_rx_fifo_t structure.

    This mac_blank function is meant to be used by upper layers to control the interrupt rate of the device. The first argument is the device context meant to be used as the first argument to poll_blank.

    The other fields mrf_normal_blank_time and mrf_normal_pkt_cnt specify the default interrupt interval and packet count threshold, respectively. These parameters may be used as the second and third arguments to mac_blank when the upper layer wants the driver to revert to the default interrupt rate.

    The interrupt rate is controlled by the upper layer by calling poll_blank with different arguments. The interrupt rate can be increased or decreased by the upper layer by passing a multiple of these values to the last two arguments of mac_blank. Setting these avlues to zero disables the interrupts and NIC is deemed to be in polling mode.

    The mac_poll is the driver supplied function is used by upper layer to retrieve a chain of packets (upto max count specified by second argument) from the Rx ring corresponding to the earlier supplied mrf_arg during mac_resource_add (supplied as first argument to mac_poll).
  • 'mac_resource_update' -
     extern void mac_resource_update(mac_t \*);
    Invoked by the driver when the available resources have changed.
  • 'mac_rx' -
     extern void mac_rx(mac_t \*, mac_resource_handle_t, mblk_t \*);
    This function should be called to deliver a chain of packets, contained in mblk_t structures, for reception. Fragments of the same packet should be linked together using the b_cont field. Separate packets should be linked using the b_next field of the leading fragment. If the packet chain was received by a registered resource then the appropriate mac_resource_handle_t value should be supplied as the second argument to the function. The protocol stack will use this value as a hint when trying to load-spread across multiple CPUs. It is assumed that packets belonging to the same flow will always be received by the same resource. If the resource is unknown or is unregistered then NULL should be passed as the second argument.

Data-Link Services (DLS) Module

The DLS module provides Data-Link Services interface analogous to DLPI. The DLS interface is a kernel-level functional interface as opposed to the STREAMS message based interface specified by DLPI. This module provides the interfaces necessary for upper layer to create and destroy a dala link service; It also provides the interfaces necessary to plumb and unplumb the NIC. The plumbing and unplumbing of NIC for GLDv3 based device drivers is unchanged from the older GLDv2 or monolithic DLPI device drivers. The major changes are in data paths which allow direct calls, packet chains and much finer grained control over NIC.

Data-Link Driver (DLD)

The Data-Link Driver provides a DLPI using the interfaces provided by the DLS and MAC modules. The driver is configured using IOCTLs passed to a control node. These IOCTLs create and destroy separate DLPI provider nodes. This module deals with DLPI messages necessary to plumb/unplumb the NIC and provides the backward compatibility for data path via STREAMs for non GLDv3 aware clients.

GLDv3 Link aggregation architecture

The GLDv3 framework provides support for Link Aggregation as defined by IEEE 802.3ad. The key design principles while designing this facility were:
  • Allow GLDv3 MAC drivers to be aggregated without code change
  • The performance of non-aggregated devices must be preserved
  • The performance of aggregated devices should be cumulative of line rate for each member i.e. minimal overheads due to aggregation
  • Support both manual configuration and Link Aggregation Control protocol (LACP)
GLDv3 link aggregation is implement by means of a pseudo driver called 'aggr'. It registers virtual ports corresponding to link aggregation groups with the GLDv3 Mac layer. It uses the client interface provided by MAC layer to control and communicate with aggregated MAC ports as illustrated below in Fig 7. It also export a pseudo 'aggr' device driver which is used by 'dladm' command to configure and control the link aggregated interface. Once a MAC port is configured to be part of link aggregation group, it cannot be simultaneously accessed by other MAC clients clients such as DLS layer. The exclusive access is enforced by the MAC layer. The implementation of LACP is implemented by the 'aggr' driver which has access to individual MAC ports or links.

Fig 7

Fig. 7

The GLDv3 aggr driver acts a normal MAC module to upper layer and appears as a standard NIC interface which once created with 'dladm', can be configured and managed by 'ifconfig'. The 'aggr' module registers each MAC port which is part of the aggregation with the upper layer using the 'mac_resource_add' function such that the data paths and interrupts from each MAC port can be independently managed by the upper layers (see Section 8b). In short, the aggregated interface is managed as a single interface with possibly one IP address and the data paths are managed as individual NICs by unique CPUs/Squeues providing aggregation capability to Solaris with near zero overheads and linear scalability with respect to number of MAC ports that are part of the aggregation group.

Checksum offload

Solaris 10 improved the H/W checksum offload capability further to improve overall performance for most applications. 16-bit one's complement checksum offload framework has existed in Solaris for some time. It was originally added as a requirement for Zero Copy TCP/IP in Solaris 2.6 but was never extended until recently to handle other protocols. Solaris defines two classes of checksum offload:
  • Full - Complete checksum calculation in the hardware, including pseudo-header checksum computation for TCP and UDP packets. The hardware is assumed to have the ability to parse protocol headers.
  • Partial - "Dumb" one's complement checksum based on start, end and stuff offsets describing the span of the checksummed data and the location of the transport checksum field, with no pseudo-header calculation ability in the hardware.
Adding support for non-fragmented IPV4 cases (unicast or multicast) is trivial for both transmit and receive, as most modern network adapters support either class of checksum offload with minor differences in the
interface. The IPV6 cases are not as straightforward, because very few full-checksum network adapters are capable of handling checksum calculation for TCP/UDP packets over IPV64.

The fragmented IP cases have similar constraints. On transmit, checksumming applies to the unfragmented datagram. In order for an adapter to support checksum offload, it must be able to buffer all of the IP fragments (or perform the fragmentation in hardware) before finally calculating the checksum and sending the fragments over the wire; until then, checksum offloading for outbound IP fragments cannot be done. On the other hand, the receive fragment reassembly case is more flexible since most full-checksum (and all partial-checksum) network adapters are able to compute and provide the checksum value to the network stack. During fragment reassembly stage, the network stack can derive the checksum status of the unfragmented datagram by combining the values altogether.

Things were simplified by not offloading checksum when IP option were present. For partial-checksum offload, certain adapters limit the start offset to a width sufficient for simple IP packets. When the length of protocol headers exceeds such limit (due to the presence of options), the start offset will wrap around causing incorrect calculation. For full-checksum offload, none of the capable adapters is able to correctly handle IPV4 source routing option.

When transmit checksum offload takes place, the network stack will associate eligible packets with ancillary information needed by the driver to offload the checksum computation to hardware.

In the inbound case, the driver has full control over the packets that get associated with hardware-calculated checksum values. Once a driver advertises its capability via DL CAPAB HCKSUM, the network stack will accept full and/or partial-checksum information for IPV4 and IPV6 packets. This process happens for both non-fragmented and fragmented payloads.

Fragmented packets will first need to go through the reassembly process because checksum validation happens for fully reassembled datagrams. During reassembly, the network stack combines the hardware-calculated checksum value of each fragment.

'dladm' - New command for datalink administration

Over period of time, 'ifconfig' has become severely overloaded trying to manage various layers in the stack. Solaris 10 introduced 'dladm' command to manage the data link services and ease the burden on 'ifconfig'. The dladm command operates on three kinds of object:
  • 'link' - Data-links, identified by a name
  • 'aggr' - Aggregations of network devices, identified by a key
  • 'dev' - Network devices, identified by concatenation of a driver name and an instance number.
The key of an aggregation must be an integer value between 1 and 65535. Some devices do not support configurable data-links or aggregations. The fixed data-links provided by such devices can be viewed using dladm but not configured.

The GLDv3 framework allows users to select the outbound load balancing policy across various members of aggregation while configuring the aggregation. The policy specifies which dev object is used to send packets. A policy consists of a list of one or more layers specifiers separated by commas. A layer specifier is one of the following:
  • L2 - Select outbound device according to source and destination MAC addresses of the packet.
  • L3 - Select outbound device according to source and destination IP addresses of the packet.
  • L4 - Select outbound device according to the upper layer protocol information contained in the packet. For TCP and UDP, this includes source and destination ports. For IPsec, this includes the SPI (Security Parameters Index.)
For example, to use upper layer protocol information, the following policy can be used:

            -P L4

To use the source and destination MAC addresses as well as the source and destination IP addresses, the following policy can be used:

            -P L2,L3

The framework also supports Link aggregation control protocol (LACP) for GLDv3 based aggregations which can be controlled by 'dladm' via the  'lacp-mode' and 'lacp-timer' sub commands. The 'lacp-mode' can be set to 'off', 'active' or 'passive'.

When a new device is inserted into a system. During reconfiguration boot or DR a default non-VLAN data-link will be created for the device. The configuration of all objects will persist across reboot.

In future, 'dladm' and its private file where all persistant information is stored ('/etc/datalink.conf') will be used to manage device specific parameters which are currently managed via 'ndd', driver specific configuration files and /etc/system.

7. Tuning for performance:

The Solaris 10 stack is tuned to give steller out of box performance irrespective of the H/W used. The secret lies in using techniques like dynamically switching between interrupt vs polling mode which gives very good latencies when load is managible by allowing the NIC to interrupt per packet and switching to polling mode for better throughput and well bounded latencies when load is very high. The defaults are also carefully picked based on H/W configuration. For instance, the 'tcp_conn_hash_size' tunable was very conservative pre Solaris 10. The default value of 512 hash buckets was selected based on lowest supperted configuration (in terms of memory). Solaris 10 looks at the free memory at boot time to choose the value for 'tcp_conn_hash_size'. Similarly, when connection is 'reaped' from the time wait state, the memory associated with the connection instance is not freed instantly (again based on the total system memory available) but instead put in a 'free_list'. When new connections arrive if a given period, TCP tries to reuse memory from 'free_list' otehr wise 'free_list' is periodically cleaned up.
Inspite of these features, sometimes its necessary to tweak some tunables to deal with extreme cases or specific workloads. We discuss some tunables below that control the stack behaviour. Care should be taken to understand the impact otherwise the system might become unstable. Its important to note that for bulk of the applications and workloads, the defaults will give the best results.
  • 'ip_squeue_fanout' - Controls whether incoming connections from one NIC are fanned out across all CPUs. A value of 0 means incoming connections are assigned to the squeue attached to the interrupted CPU. A value of 1 means the connections are fanned out across all CPUs. The latter is required when NIC is faster than the CPU (say 10Gb NIC) and multiple CPU need to service the NIC. Set via /etc/system by adding the following line
     set ip:ip_squeue_fanout=1
  • 'ip_squeue_bind' - Controls whether worker threads are bound to specific CPUs or not. When bound (default), they give better locality. The non default value (don't bind) is often chosen only when processor sets are to be created on the system. Unset via /etc/system by adding the following line
     set ip:ip_squeue_bind=0
  • 'tcp_squeue_wput' - controls the write side squeue drain behavior.
    • 1 - Try to process your own packets but don't try to drain the squeue
    • 2 - Try to process your own packet as well as any queued packets.
    The default value is 2 and can be changed via /etc/system by adding
     set ip:tcp_squeue_wput=1
    This value should be set to 1 when number of CPUs are far more than number of active NICs and the platform has inherently higher memory latencies where chances of an application thread doing squeue drain and getting pinned is high.

  • 'ip_squeue_wait' - Controls the amount of time in 'ms' a worker thread will wait before processing queued packets in the hope that interrupt or writer thread will process the packet. For servers which see enough traffic, the default of 10ms is good but for machines which see more interactive traffic (like desktops) where latency is an issue, the value should be set to 0 via /etc/system by adding set
  • In addition, some protocol level tuning like changing the max_buf, high and low water mark, etc if beneficial specially on large memory systems.

8. Future

The future direction of Solaris networking stack will continue to build  on better vertical integration between layers which will improve locality and performance further. With the advent of Chip multithreading and multi core CPUs, the number of parallel execution pipelines will continue to increase even on low end systems. A typical 2 CPU machine today is dual core providing 4 execution pipelines and soon going to have hyperthreading as well.

The NICs are also becoming advanced offering multiple interrupts via MSI-X, small classification capabilities, multiple DMA channels, and various stateless offloads like large segment offload etc.
Future work will continue to leverage on these H/W trends including support for TCP offload engines, Remote direct memory access (RDMA), and iSCSI.  Some other specific things that are being worked on:
  • Network stack virtualization - With the industry wide trend of server consolidation and running multiple virtual machines on same physical instance, its important the Solaris stack can be virtualized efficiently.
  • B/W Resource control - The same trend thats driving network virtualization is also driving the need to control the bandwidth usage for various applications and virtual machines on same box efficiently.
  • Support for high performance 3rd party modules - The current Solaris 10 framework is still private to modules from Sun. STREAMs based modules are the only option for the ISVs and they miss the full potential of the new framework.
  • Forwarding performance - Work is being done to further improve the Solaris forwarding performance.
  • Network security with performance - The world is becoming increasing complex and hostile. Its not possible to choose between performance and security anymore. Both are a requirement. Solaris was always very strong in security and Solaris 10 makes great strides in enabling security without sacrificing performance. Focus will continue on enhancing IPfilter performance and functionality and a whole new approach and detecting Denial of service attacks and dealing with them.

9. Acknowledgments

Many Thanks to Thirumalai Srinivasan, Adi Masputra, Nicolas Droux, and Eric Cheng for contributing parts of this text. Also thanks are due to all the members of Solaris networking community for their help.



Posted by guest on February 02, 2006 at 11:14 AM PST #

Post a Comment:
Comments are closed for this entry.

Sunay Tripathi, Sun Distinguished Engineer, Solaris Core OS, writes a weblog on architecture for Solaris Networking Stack, GLDv3 (Nemo) framework, Crossbow Network Virtualization and related things


« July 2016

No bookmarks in folder

Solaris Networking: Magic Revealed

No bookmarks in folder

solaris networking