Solaris Link Aggregations (1): The Architecture
By ndroux on Feb 14, 2006
One of my recent jobs was to architect and implement the Solaris Link Aggregation component of project GLDv3, a.k.a. Nemo, which has been part of OpenSolaris since day one, and recently shipped as part of Solaris 10 Update 1. (One of my other jobs was Technical Lead of GLDv3 itself for its integration into Solaris 11/Nevada, but that's a story worth a separate blog entry.)
Link aggregations consist of groups of Network Interface Cards (NICs) that provide increased bandwidth and higher availability. Network traffic is distributed among the members of an aggregation, and the failure of a single NIC should not affect the availability of the aggregation as long as there are other functional NICs in the same group.
Link aggregations have been successfully deployed on Solaris within Sun as well as in customers production environments. Since there has been a lot of interest in that area, I decided to start a series of short articles introducing the concepts behind this feature, its implementation in Solaris, and how it can be (easily!) deployed. For now I will give an overview of the aggregation architecture, and will dig deeper into details in future articles.
The following figure represents the aggregation driver and how it relates to other Nemo components. It does not represent the data paths, which are slightly different than what is represented there.
The MAC layer, which is part of GLDv3, is the central point of access to Network Interface Cards (NICs) in the kernel. At the top, it provides a client interface that allows a client to send and receive packets to and from NICs, as well as configure, stop and start NICs. A the bottom, the MAC layer provides a provider interface which is used by NIC drivers to interface with the network stack. In the figure above, the client is the Data-Link Service (DLS) which provides SAP demultiplexing and VLAN support for the rest of the stack. The Data-Link Driver (DLD) provides a STREAMS interface between Nemo and DLPI consumers. We'll get into more details on DLS and DLD, which are also part of GLDv3, in a future article. Sunay also posted a general description of these components in his blog.
The core of the link aggregation feature is provided by the "aggr" kernel pseudo driver. This driver acts as both a MAC client and a MAC provider. The aggr driver implements a MAC provider interface so that it looks like any other MAC device, which allows us to manage aggregation devices as if they were a regular NIC from the rest of Solaris. We'll discuss the source of aggr in a future article.
Each aggregation of NICs is called an "aggregation group". Aggregation groups are uniquely identified by a key, an integer value unique on the system. We'll talk more about key values when we get into the administrative model. Note there is only one pseudo instance of the aggr driver. Each aggregation group is instanciated as a MAC port of that pseudo instance. Aggregations are managed (i.e. created, deleted, modified, queried) by the dladm(1M) command line utility, which communicates with the aggregation driver through a private control interface.
The aggregation group is also a consumer of the MAC client interface, which it uses to control the individual NICs that are part of aggregation groups. The aggregation driver controls (starts/stops/etc) individual NICs, and sets the MAC address of the individual NICs according to the MAC address of the aggregation itself, which can be automatically picked from one of the constituents ports, or set statically through dladm(1M). The aggregation driver also specifies the send and receive routines needed for the transmission of packets through the aggregation.
Another advantage of the MAC layer and its use by the aggregation driver is that any GLDv3 can be part of an aggregation, without any special support. Currently bge (1Gb/s Broadcom based), xge (10 Gb/s Neterion based), and e1000g (1Gb/s Intel based) devices can be combined to form link aggregations.
Obviously there's a lot more to talk about. Stay tuned for future articles of this series with more information on the administration model, detailed design issues, and the data-path. Thanks for listening...