Saturday Oct 01, 2011

See you at Oracle OpenWorld!

I will be speaking at Oracle OpenWorld 2011 next week in San Francisco and I hope you will join me to learn more about Oracle Solaris 11, zones, network virtualization/Crossbow, I/O scalability, SPARC SuperCluster, and Solaris on Exadata and Exalogic. I will be speaking at the following sessions:
  • Session ID: 14646
    Session Title: Delivering the Near-Impossible: Around-the-Clock Global Secure Infrastructure
    Venue / Room: Moscone South - 252
    Date and Time: 10/3/11, 17:00 - 18:00

  • Session ID: 16242
    Session Title: Oracle Solaris Technical Panel with the Core Solaris Developers
    Venue / Room: Moscone South - 236
    Date and Time: 10/5/11, 13:15 - 14:15

  • Session ID: 16243
    Session Title: Cloud-Scale Networking with Oracle Solaris 11 Network Virtualization
    Venue / Room: Moscone South - 236
    Date and Time: 10/5/11, 17:00 - 18:00
You will also find me at the Meet the Experts area at the Moscone South DEMOgrounds for a 1-1 chat on Monday 3-4pm and Tuesday 9:45-11am. See the complete list of Oracle Solaris-related events at OOW here: http://bit.ly/oow11-solaris

You can also follow me on twitter at @ndroux for my live coverage of OOW and up-to-the-minute status.

See you there!

Monday Nov 15, 2010

Solaris 11 Express Released! On Crossbow, NUMA I/O, Exadata, and more…

After many years under development, Solaris 11 Express is now available from Oracle. This milestone makes the many features and improvements that we have been working on since Solaris 10 available with Oracle Premier Support! As the architect for Crossbow and NUMA I/O, I wanted to spend some time here to give you a quick introduction and my perspective on these features.

Solaris 11 Express includes Crossbow, which we integrated in Solaris a couple of years ago, and have been steadily improving since then. Crossbow provides network virtualization and resource control designed into the core networking stack. This tight integration allows us to provide the best performance, leveraging advanced NIC hardware features and providing scalability on Oracle Systems from the 8 sockets Nahalem-based Sun Fire x4800 to the four socket SPARC T3-4.

Management of Crossbow VNICs and QoS is also closely integrated with other Solaris administration tools and features. For example, VNICs and bandwidth limits can be easily managed with the common data link management tool diadm. Crossbow allows the Solaris Zones virtualization architecture to be taken to the next level, allowing each zone to have its own VNIC(s) and virtual link speed, improving separation between zones by automatically binding network kernel resources (threads and interrupts) to the CPUs belonging to a zone.

Crossbow features such as virtual switching, virtual NICs, bandwidth limits, and resource control can be combined with other networking features introduced by Solaris 11 Express (Load balancing, VRRP, bridging, revamped IP tunnels, improved observability) to provide the ideal environment to build fully virtual networks in a box for simulation, planning, debugging, and teaching. Thanks to these features and the high efficiency of Zones, Solaris 11 Express provides the foundation for an open networking platform.

While an integrated data path, QOS, resource control, and scalability built-in are key for performance, equally important is managing and placing these resources on large systems. The Sun Fire x4800 and Oracle SPARC T3-4 for instance provide several processor sockets, connected to multiple PCI Express I/O switches. On such large systems, the processors are divided into multiple NUMA (Non-Uniform Memory Access) nodes, connected through high-speed interconnect. I/O requests as well DMA transfers to and from devices must be routed through the CPU interconnect, and the distance between devices and the CPUs used to process I/O requests must be kept to a minimal for best overall system scalability.

NUMA I/O is a new Solaris kernel framework which is used by other Solaris I/O subsystems (such as a network stack) to register their I/O resources (kernel threads, interrupts, and so on) and define at a high-level the affinity between these resources. The NUMA I/O framework discovers the I/O topology of the machine, and places these I/O resources on the physical CPUs according to the affinities specified by the caller, as well as the NUMA and I/O hardware topology.

The Oracle Exadata Database Machine running Solaris 11 Express depends heavily on NUMA I/O to achieve best Infiniband RDSv3 performance, which is the protocol used by the Exadata database compute nodes (Sun Fire x4800 in the case of the Oracle Exadata X2-8) to communicate with the Exadata Storage Servers. NUMA I/O is designed to be a common framework, and work is in progress to leverage it from other Solaris I/O subsystems.

Learn more about these features and the many other innovations provided by Solaris 11 Express, such as IPS, new packaging system that redefines the OS software life cycle, ZFS crypto, a new installer, Zones improvements, etc, on the Solaris 11 Express site at oracle.com. There you will also find information on how to download Solaris 11 Express, details the type of support available, documentation, and many other community resources.

Enjoy!

Tuesday May 26, 2009

Crossbow for Cloud Computing Architectures

The first phase of Crossbow was integrated in OpenSolaris last December. I recently posted a paper which shows how Crossbow technology such as virtual NICs (VNICs), virtual switches, Virtual Wires (vWires), and Virtual Network Machines (VNMs) can be used as a foundation to build isolated virtual networks for cloud computing architectures. The document is available on opensolaris.org. Please share your comments on the Crossbow mailing list at crossbow-discuss@opensolaris.org.

Thursday Feb 14, 2008

Private virtual networks for Solaris xVM and Zones with Crossbow

Virtualization is great: save money, save lab space, and save the planet. So far so good! But how do you connect these virtual machines, allocate them their share of the bandwidth, and how do they talk to the rest of the physical world? This is where the OpenSolaris Project Crossbow comes in. Today we are releasing a new pre-release snapshot of Crossbow, an exciting OpenSolaris project which enables network virtualization in Solaris, network bandwidth partitioning, and improved scalability of network traffic processing.

This new release of the project includes a new features which allows you to build complete virtual networks that are isolated from the physical network. Virtual machines and Zones can be connected to these virtual networks, and isolated from the rest of the physical network through firewall/NAT, etc. This is useful when you want to prototype a distributed application before deploying it on a physical network, or if you want to isolate and hide your virtual network.

This article shows how Crossbow can be used together with NAT to build a complete virtual network connecting multiple Zones within a Solaris host. The same technique applies to xVM Server x64 as well, since xVM uses Crossbow for its network virtualization needs. A detailed description of the Crossbow virtualization architecture can be found in my document here.

In this example, we will build the following network:

First we need to build our virtual network, this can be done very simply using Crossbow using etherstubs. An etherstub is a pseudo ethernet NIC which can be created with dladm(1M). VNICs can then be created on top of that etherstub. The Crossbow MAC layer of the stack will implicitly create a virtual switch between all the VNICs sharing the same etherstub. In the following example we create an etherstub and three VNICs for our virtual network.


# dladm create-etherstub etherstub0
# dladm create-vnic -d etherstub0 vnic0
# dladm create-vnic -d etherstub0 vnic1
# dladm create-vnic -d etherstub0 vnic2

By default Crossbow will assign a random MAC address to the VNICs, as we can see from the following command:


# dladm show-vnic
LINK OVER SPEED MACADDRESS MACADDRTYPE
vnic0 etherstub0 0 Mbps 2:8:20:e7:1:6f random
vnic1 etherstub0 0 Mbps 2:8:20:53:b4:9 random
vnic2 etherstub0 0 Mbps 2:8:20:47:b:9c random

You could also assign a bandwidth limit to each VNIC by setting the maxbw property during VNIC creation. At this point we are done creating our virtual network. In the case of xVM, you would specify "etherstub0" instead of a physical NIC to connect the xVM domain to the virtual network. This would cause xVM to automatically create a VNIC on top of etherstub0 when booting the virtual machine. xVM configuration is described in the xVM configuration guide.

Now that we have our VNICs we can create our Zones. Zone test1 can be created as follows:


# zonecfg -z test1
test1: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:test1> create
zonecfg:test1> set zonepath=/export/test1
zonecfg:test1> set ip-type=exclusive
zonecfg:test1> add inherit-pkg-dir
zonecfg:test1:inherit-pkg-dir> set dir=/opt
zonecfg:test1:inherit-pkg-dir> end
zonecfg:test1> add net
zonecfg:test1:net> set physical=vnic1
zonecfg:test1:net> end
zonecfg:test1> exit

Note that in this case the zone is assigned its own IP instance ("set ip-type=exclusive"). This allows the zone to configure its own VNIC which is connected to our virtual network. Now it's time to setup NAT between our external network and our internal virtual network. We'll be setting up NAT with IP Filter, which is part of OpenSolaris, based on the excellent NAT write up by Rich Teer.

In our example the global zone will be used to interface our private virtual network with the physical network. The global zone connects to the physical network via eri0, and to the virtual private network via vnic0, as shown by the figure above. The eri0 interface eri0 is configured the usual way, and in our case its address is assigned using DHCP:


# ifconfig eri0
eri0: flags=201000843 mtu 1500 index 2
inet 192.168.84.24 netmask ffffff00 broadcast 192.168.84.255
ether 0:3:ba:94:65:f8

We will assign a static IP address to vnic0 in the global zone:


# ifconfig vnic0 plumb
# ifconfig vnic0 inet 192.168.0.1 up
# ifconfig vnic0
vnic0: flags=201100843 mtu 9000 index 6
inet 192.168.0.1 netmask ffffff00 broadcast 192.168.0.255
ether 2:8:20:e7:1:6f

Note that the usual configuration variables (e.g. /etc/hostname.) must be populated for the configuration to persist across reboots). We must also enable IPv4 forwarding on the global zone. Run routeadm(1M) to display the current configuration, and if "IPv4 forwarding" is disabled, enable it with the following command:


# routeadm -u -e ipv4-forwarding

Then we can enable NAT on the eri0 interface. We're using a simple NAT configuration in /etc/ipf/ipnat.conf:


# cat /etc/ipf/ipnat.conf
map eri0 192.168.0.0/24 -> 0/32 portmap tcp/udp auto
map eri0 192.168.0.0/24 -> 0/32

We also need to enable IP filtering on our physical network-facing NIC eri0. We run "ipnat -l" to verify that our NAT rules have been enabled.


# svcadm enable network/ipfilter
# ipnat -l
List of active MAP/Redirect filters:
map eri0 192.168.0.0/24 -> 0.0.0.0/32 portmap tcp/udp auto
map eri0 192.168.0.0/24 -> 0.0.0.0/32

Now we can boot our zones:


# zoneadm -z test1 boot
# zoneadm -z test2 boot

Here I assigned the address 192.168.0.100 to the vnic1 assigned to zone test1:


# zlogin test1
[Connected to zone 'test1' pts/2]
...
# ifconfig vnic1
vnic1: flags=201000863 mtu 9000 index 2
inet 192.168.0.100 netmask ffffff00 broadcast 192.168.0.255
ether 2:8:20:53:b4:9
# netstat -nr

Routing Table: IPv4
Destination Gateway Flags Ref Use Interface
-------------------- -------------------- ----- ----- ---------- ---------
default 192.168.0.1 UG 1 0
default 192.168.0.1 UG 1 0 vnic1
192.168.0.0 192.168.0.100 U 1 0 vnic1
127.0.0.1 127.0.0.1 UH 1 2 lo0

Routing Table: IPv6
Destination/Mask Gateway Flags Ref Use If
--------------------------- --------------------------- ----- --- ------- -----
::1 ::1 UH 1 0 lo0

Note that the zone appears to be on a network and has what looks like a regular NIC with a regular MAC address. In reality, this zone is connected to a virtual network isolated from the physical network. From that non-global zone, we can now reach out to the physical network via NAT running in the global zone:


# ssh someuser@129.146.17.55
Password:
Last login: Tue Feb 12 13:35:03 2008 from somehost
...

From the global zone, we can query NAT to see the translations taking place:


# ipnat -l
List of active MAP/Redirect filters:
map eri0 192.168.0.0/24 -> 0.0.0.0/32 portmap tcp/udp auto
map eri0 192.168.0.0/24 -> 0.0.0.0/32

List of active sessions:
MAP 192.168.0.100 37153 <- -> 192.168.84.24 26333 [129.146.17.55 22]

Of course this is only the tip of the iceberg. You could deploy NAT from a non-global zone itself, or deploy a virtual router on your virtual network, you could enable additional filtering rules, etc, etc. Of course you are not limited to only one virtual network. You can create multiple virtual networks within a host, route between these networks, etc. We are exploring some of the possibilities as part of the Crossbow and Virtual Network Machines projects.

Monday Apr 02, 2007

Virtual Switching in Solaris with Crossbow VNICs

Virtual NICs, also known as VNICs, are a core components of project Crossbow. They allow physical NICs to be shared by multiple Zones or virtual machines such as Xen domains. VNICs appear to the rest of the system as regular NICs. VNICs can be assigned a subset of the hardware resources (interrupts, rings, etc) made available by the underlying hardware.

In order to provide connectivity between the multiple Zones or virtual machines sharing a single physical NIC, the VNIC layer also provides a data-path between the VNICs defined on top of the same underlying NIC. The VNICs sharing the same underlying NIC appear to be part of the same segment, i.e. connected to a same virtual switch. The virtual switch concept also allow fully virtual networks to be be built within a machine.

A couple of days ago I posted a first draft design document describing the concept of virtual switches, how they are implemented by VNICs in Solaris, and how they can be used in practice.

Tuesday Sep 12, 2006

OpenSolaris virtualization technologies

It was great to be in Seattle last week to kick off a new year of Tech Days. The monthly events this year include a day dedicated to OpenSolaris, which is a great opportunity to catch up with OpenSolaris projects, hear from OpenSolaris engineers, and connect with the local community. Watch for our tour coming to a city near you.

In Seattle, I had the pleasure to give a presentation on OpenSolaris Virtualization Technologies covering a wide range of topics such as Zones, BrandZ, Xen, and CrossBow. My presentation is now available online (I'd like to thank Tim, Todd, Joost, David, and Nils for providing some material for this presentation.)

Other excellent talks at the OpenSolaris day included a presentation by Stephen on building and deploying OpenSolaris, a great presentation from Glenn on OpenSolaris security features, as well as energized talk by Jim on OpenSolaris POD (Performance Observability and Debugging.) Teresa concluded this busy day by leading a BOF which lead to the creation of a new OpenSolaris user group in Seattle

Technorati Tag:
Technorati Tag:
Technorati Tag:

Wednesday Aug 30, 2006

OpenSolaris Day at Tech Days in Seattle, September 6

The first stop of the OpenSolaris world tour as part of the Tech Days will be in Seattle on September 6th. Attendence to the OpenSolaris day is free, but you need to register very soon, space is limited.

I've been working on a presentation on the OpenSolaris Virtualization Technologies that I'll will be presenting in Seattle. See the agenda for the OpenSolaris day in particular or the whole event for more info.

Technorati Tag:

Sunday Aug 27, 2006

CrossBow early access bits now available on OpenSolaris.org

I just spent most of last week driving the first release of CrossBow on OpenSolaris. The first part of my week was dedicated to some intense coding to putback about 15 fixes and features to the project gate on time for the release, and was followed by a marathon of documentation and process to get the bits available to the rest of the world [1].

CrossBow redefines network virtualization and resource control. Check out our project if you haven't already, try the goods, and join us at crossbow-discuss@opensolaris.org

[1] Thanks to Carol Gayo for setting up the initial download page, Michael Lim and Gopi Kunhappan for sanity checking the binaries before the release, as well as Dan Groves and Stephen Lau for their help building and posting the images.

Technorati Tag:
Technorati Tag:

Thursday Jun 08, 2006

Project Nemo now live on OpenSolaris

Project Nemo is now an official OpenSolaris project. Nemo, a.k.a. GLDv3, is a high-performance device driver framework which provides 802.3ad Link Aggregation and VLAN support for off-the-shelf device drivers. The following drivers are currently based on Nemo (a.k.a. GLDv3) framework: bge, e1000g, xge, nge, rge, ixgb. Project Nemo was also a recipient of Sun's 2006 Chairman Award for Innovation.

Nemo also kicks-a\*\* in the performance area with features such as direct function calls and packet chaining between IP and device driver, IP controlling the NIC and dynamically blanking interrupts, and support for stateless hardware offloading.

The project page will point you to the Nemo source code in Solaris, design documents, list of active and future projects, etc.

Technorati Tag:
Technorati Tag:

Wednesday Jun 07, 2006

Know your network data-links - dladm(1M) can help

Data-links in Solaris correspond to NICs, aggregations, or VLANs. The dladm(1M) command line tool provides an easy way to list them all, including some of their properties. dladm(1M) was introduced as part of Project Nemo in Solaris Nevada, Solaris 10 Update 1, and OpenSolaris.

There are two dladm subcommands that I want to cover here: the first is show-link, which shows the non-VLAN and VLAN data links, their MTU, and underlying device or aggregation. The second subcommand is show-dev, which shows the network devices present on the machine and their hardware state, such as link state, link speed, and duplex information.

The following example shows dladm in action on jurassic, a large server that hosts the home directories and email for the members of the OS development organization at Sun. Jurassic allows us to "eat our own dog food" by always running the latest Solaris development build in a production environment.


jurassic# dladm show-link
hme0            type: legacy    mtu: 1500       device: hme0
qfe0            type: legacy    mtu: 1500       device: qfe0
qfe1            type: legacy    mtu: 1500       device: qfe1
qfe2            type: legacy    mtu: 1500       device: qfe2
qfe3            type: legacy    mtu: 1500       device: qfe3
ce0             type: legacy    mtu: 1500       device: ce0
qfe16           type: legacy    mtu: 1500       device: qfe16
qfe17           type: legacy    mtu: 1500       device: qfe17
qfe18           type: legacy    mtu: 1500       device: qfe18
qfe19           type: legacy    mtu: 1500       device: qfe19
xge1            type: non-vlan  mtu: 1500       device: xge1
bge0            type: non-vlan  mtu: 1500       device: bge0
bge1            type: non-vlan  mtu: 1500       device: bge1
hme1            type: legacy    mtu: 1500       device: hme1
qfe8            type: legacy    mtu: 1500       device: qfe8
qfe9            type: legacy    mtu: 1500       device: qfe9
qfe10           type: legacy    mtu: 1500       device: qfe10
qfe11           type: legacy    mtu: 1500       device: qfe11
ce1             type: legacy    mtu: 1500       device: ce1
bge2            type: non-vlan  mtu: 1500       device: bge2
bge3            type: non-vlan  mtu: 1500       device: bge3
qfe20           type: legacy    mtu: 1500       device: qfe20
qfe21           type: legacy    mtu: 1500       device: qfe21
qfe22           type: legacy    mtu: 1500       device: qfe22
qfe23           type: legacy    mtu: 1500       device: qfe23
aggr1           type: non-vlan  mtu: 1500       aggregation: key 1
aggr226001      type: vlan 226  mtu: 1500       aggregation: key 1
aggr224001      type: vlan 224  mtu: 1500       aggregation: key 1
aggr108001      type: vlan 108  mtu: 1500       aggregation: key 1
aggr68001       type: vlan 68   mtu: 1500       aggregation: key 1
aggr58001       type: vlan 58   mtu: 1500       aggregation: key 1
aggr106001      type: vlan 106  mtu: 1500       aggregation: key 1
aggr56001       type: vlan 56   mtu: 1500       aggregation: key 1
aggr104001      type: vlan 104  mtu: 1500       aggregation: key 1
aggr228001      type: vlan 228  mtu: 1500       aggregation: key 1
aggr2           type: non-vlan  mtu: 1500       aggregation: key 2
aggr226002      type: vlan 226  mtu: 1500       aggregation: key 2
aggr224002      type: vlan 224  mtu: 1500       aggregation: key 2
aggr108002      type: vlan 108  mtu: 1500       aggregation: key 2
aggr68002       type: vlan 68   mtu: 1500       aggregation: key 2
aggr58002       type: vlan 58   mtu: 1500       aggregation: key 2
aggr106002      type: vlan 106  mtu: 1500       aggregation: key 2
aggr56002       type: vlan 56   mtu: 1500       aggregation: key 2
aggr104002      type: vlan 104  mtu: 1500       aggregation: key 2
aggr228002      type: vlan 228  mtu: 1500       aggregation: key 2

From the example above, you can see that jurassic has a bunch of data-link corresponding to devices of various kinds, some of which are combined to form two aggregations, on top of which nine VLANs are defined. To further improve high-availability, these VLANs are grouped to form nine IPMP groups spanning two separate physical switches. IPMP groups are not yet listed by dladm show-link but will be after the IPMP reachitecture completes.

show-dev is a related dladm(1M) subcommand which lists only physical NICs along with their physical link state. The following shows dladm show-dev in action on jurassic.


jurassic# dladm show-dev
hme0            link: unknown   speed: 0     Mbps       duplex: unknown
qfe0            link: unknown   speed: 0     Mbps       duplex: unknown
qfe1            link: unknown   speed: 0     Mbps       duplex: unknown
qfe2            link: unknown   speed: 0     Mbps       duplex: unknown
qfe3            link: unknown   speed: 0     Mbps       duplex: unknown
ce0             link: unknown   speed: 1000  Mbps       duplex: full
qfe16           link: unknown   speed: 0     Mbps       duplex: unknown
qfe17           link: unknown   speed: 0     Mbps       duplex: unknown
qfe18           link: unknown   speed: 0     Mbps       duplex: unknown
qfe19           link: unknown   speed: 0     Mbps       duplex: unknown
xge1            link: unknown   speed: 10000 Mbps       duplex: full
bge0            link: up        speed: 1000  Mbps       duplex: full
bge1            link: up        speed: 1000  Mbps       duplex: full
hme1            link: unknown   speed: 0     Mbps       duplex: unknown
qfe8            link: unknown   speed: 0     Mbps       duplex: unknown
qfe9            link: unknown   speed: 0     Mbps       duplex: unknown
qfe10           link: unknown   speed: 0     Mbps       duplex: unknown
qfe11           link: unknown   speed: 0     Mbps       duplex: unknown
ce1             link: unknown   speed: 1000  Mbps       duplex: full
bge2            link: up        speed: 1000  Mbps       duplex: full
bge3            link: up        speed: 1000  Mbps       duplex: full
qfe20           link: unknown   speed: 0     Mbps       duplex: unknown
qfe21           link: unknown   speed: 0     Mbps       duplex: unknown
qfe22           link: unknown   speed: 0     Mbps       duplex: unknown
qfe23           link: unknown   speed: 0     Mbps       duplex: unknown

So there you have it, dladm away! Also check out the dladm(1M) man page for a full description of these subcommands and more.

Technorati Tag:
Technorati Tag:

Wednesday May 03, 2006

Link Aggregation vs IP Multipathing

We introduced Link Aggregation capabilities (based on IEEE 802.3ad) in Solaris as part of the Nemo project (a.k.a GLDv3). I described the Solaris Link Aggregation architecture in a previous blog entry, and it is also documented on docs.sun.com and by the dladm(1M) man page. Link aggregations provide high availability and higher throughput by aggregating multiple interfaces at the MAC layer. IP Multipathing (IPMP) provides features such as higher availability at the IP layer. It is described on docs.sun.com.

Both IPMP and Link Aggregation are based on the grouping of network interfaces, and some of their features overlap, such as higher availability. These technologies are however implemented at different layers of the stack, and have different strengths and weaknesses. The list below is my attempt to compare and contrast Link Aggregation and IPMP.

I should disclaim that I was responsible for designing and implementing Link Aggregation in Solaris, but I made every effort to keep the list below balanced and neutral :-)

  • Link aggregations are created and managed through dladm(1M). Once created, they behave like any other physical NIC to the rest of the system. The grouping of interfaces for IPMP is done using ifconfig.
  • Both link aggregations and IPMP support link-based detection failure, i.e. the health of an interface is determined from the state of the link reported by the driver.
  • The equivalent of IPMP probe-based failure detection is done with LACP (Link Aggregation Control Protocol) in the case of link aggregations. LACP is lighter weight than the ICMP based probe-based failure detection implemented by IPMP, since it is done at the MAC layer, and it doesn't require test addresses.
  • Link aggregations currently don't allow you to have separate standby interfaces that are not used until a failure is detected. If a link is part of an aggregation, it will be used to send and receive traffic if it is healthy. We're looking at providing that feature as part of an RFE.
  • Aggregating links between one host to two different switches through a single aggregation is not supported. Link aggregations form trunks between two end-points, which can be hosts or switches, as per IEEE 802.3ad. They are implemented at the MAC layer and require all the constituent interfaces of an aggregation to use the same MAC address. Since IPMP is implemented at the network layer, it doesn't have that limitation.
  • Link aggregations currently requires the underlying driver to use Nemo/GLDv3 interfaces (the list currently includes bge, e1000g, xge, nge, rge, ixgb).
  • Link aggregations require hardware support, i.e. if an aggregation is created on Solaris, the corresponding ports on the switches need to be aggregated as well. Some of this configuration can be automated using LACP. IPMP does not require such configuration.
  • IPMP is link-layer agnostic, link aggregations are Ethernet-specific.
  • Link aggregations provide finer grain control on the load balancing desired for spreading outbound traffic on aggregated links. E.g. load balance on transport protocol port numbers vs MAC addresses. dladm(1M) allows the inbound and outbound distribution of traffic over the constituent NICs to be easily observed.

It's also worth pointing out that IPMP can be deployed on top of aggregation to maximize performance and availability. Of course these two technologies are still being actively developed, and some of the shortcomings listed above of either technologies will be addressed with time. IPMP for instance is currently undergoing a rearchitecture which is described in this document on OpenSolaris.org. Several improvements of Nemo are also in progress, such as making Link Aggregations available to any device on the system, as described by the Nemo Unification design document. Thanks to Peter Memishian for checking my facts on IPMP and for his contributions.

Technorati Tag:
Technorati Tag:

Thursday Apr 13, 2006

Crossbow NIC Virtualization

We recently opened Project Crossbow OpenSolaris.org. Crossbow enables network virtualization and resource control on Solaris. Virtual NICs (Virtual Network Interface Cards, or VNICs) are major components of the Crossbow architecture. Since I'm responsible for this part of the project, I wanted to give you a brief introduction to VNICs, and how they are used.

Virtualization in general is a very attractive proposition, and widely used today to consolidate hosts and services. Solaris Zones, which has been available since Solaris 10, is one method by which a Solaris instance can be partitioned into multiple runtimes sharing the same Solaris kernel. Xen is another virtualization project which allows multiple virtual machines, consisting of their own (possibly different) kernel, to run on the same hardware host.

VNICs allow carving up physical NICs, or aggregation of NICs, to form virtual NICs. These NICs behave just like any other network card for the rest of the system. They have MAC addresses, can be plumbed and configured from ifconfig, etc. VNICs can be assigned to zones or virtual machines (for example Xen domains) running on the machine.

One of the benefits of Crossbow is that VNICs can be assigned their own bandwidth limits or guarantees. These limits effectively allow assigning a part of the underlying NIC bandwidth to zones or virtual machines. The enforcement of that limit is done by the squeues which are assigned to the VNIC.

When the physical NIC provides hardware classification capabilities and multiple receive rings, these receive rings are assigned to VNICs directly, and the classifier is programmed to allow traffic received for a given VNIC to land directly on the hardware rings assigned to the VNIC. This allows VNIC to be implemented without performance penalty. When the underlying physical NIC doesn't provide these hardware capabilities, the MAC layer on top of the NIC driver does the software classification to the VNICs through software rings.

The following figure shows two VNICs defined on top of the same physical NIC, and assigned to two separate zones.

The figure above also shows another option of VNICs, which consists of assigning multiple hardware rings to a single VNIC. Some of these rings can then be assigned to separate services or protocols, and be given different bandwidth or priority requirements. In the example above, zone1 assigned its own ring to https traffic, for which it can assign a higher bandwidth.

As you can see, VNIC is a powerful construct and a pillar of Crossbow. If you are interested by project Crossbow, please read more about it on our OpenSolaris project page. Our discussion forum awaits your comments or questions.

Technorati Tag:
Technorati Tag:

Monday Mar 13, 2006

Project Nemo design document now available

By popular demand the Project Nemo design document is now available on OpenSolaris.org.

Project Nemo aims at improving the performance, and accelerate the development and adoption of high-performance network drivers in Solaris. Project Nemo was integrated in Solaris Nevada build 12 and Solaris 10 Update 1, and is still being improved to support additional advanced networking hardware features, broaden support for different device types and legacy drivers, etc.

We're also hoping to make Project Nemo an official OpenSolaris project, watch this space for further news on that subject.

Technorati Tag:
Technorati Tag:

Wednesday Feb 22, 2006

Solaris Link Aggregations (2): Configuration

In my previous entry, I described the architecture of the Solaris Link Aggregations. Today, we'll take a quick look at how easily this feature can be used to create aggregations of NICs with higher bandwidth and availability.

Currently, only devices that plug into the GLDv3 (a.k.a. Nemo) framework can be aggregated. Out of the box, this currently includes bge (1 Gb/s Broadcom based), e1000g (1 Gb/s Intel based), and xge (10 Gb/s Neterion based). More drivers are being ported from DLPI or GLDv2 to GLDv3, and the Nemo Unification project currently underway and led by Cathy is going to provide a shim layer that will allow all DLPI-based drivers to plug into the GLDv3 framework.

Suppose your machine has four Gigabit Ethernet NICs, bge0-3, that you want to aggregate (our newest servers such as the Niagara-based Sun Fire T2000 and T1000, as well as our AMD Opteron-based Sun Fire X4100 and X4200 servers already come with four on-board gigabit ethernet ports, and it's also possible to add single, dual, or even quad gigabit-ethernet adapters to a system.) To aggregate these network interfaces, you simply run the following command:

# dladm create-aggr -d bge0 -d bge1 -d bge2 -d bge3 1

That's it! You now have a 4 Gb/s pipe to your machine (yes, it loves to scale, I'll show you in a future article). The previous command caused a new device "aggr1" to be created, which you can plumb and configure with ifconfig(1M) like any other device, for example:

# ifconfig aggr1 plumb
# ifconfig aggr1 inet 192.168.1.1 up

All the aggregation configuration information is persistant across reboot automatically, so you don't have to edit any other file than the usual /etc/hostname. entries.

The full set of options of the create-aggr subcommand are described in details in the dladm(1M) man page. Some of these options allow enabling LACP, changing the traffic distribution policy, setting an explicit MAC address (by default, the aggregation driver uses the address of one of the constituent ports), etc.

Note that the last argument of the create-aggr subcommand above corresponds to the key of the aggregation, which you can pick but must be unique on your machine (a future version of dladm will pick one for you.) That key is used as the PPA of the aggregation data-link that can configured using ifconfig(1M). In the example above, the specified key value was 1, so the data-link name is aggr1.

Another useful dladm(1M) you may also find useful for now is show-aggr, which allows you to display the status of an aggregation and its constituent ports, as well as traffic distribution statistics, etc.

Technorati Tag:
Technorati Tag:

Tuesday Feb 14, 2006

Solaris Link Aggregations (1): The Architecture

One of my recent jobs was to architect and implement the Solaris Link Aggregation component of project GLDv3, a.k.a. Nemo, which has been part of OpenSolaris since day one, and recently shipped as part of Solaris 10 Update 1. (One of my other jobs was Technical Lead of GLDv3 itself for its integration into Solaris 11/Nevada, but that's a story worth a separate blog entry.)

Link aggregations consist of groups of Network Interface Cards (NICs) that provide increased bandwidth and higher availability. Network traffic is distributed among the members of an aggregation, and the failure of a single NIC should not affect the availability of the aggregation as long as there are other functional NICs in the same group.

Link aggregations have been successfully deployed on Solaris within Sun as well as in customers production environments. Since there has been a lot of interest in that area, I decided to start a series of short articles introducing the concepts behind this feature, its implementation in Solaris, and how it can be (easily!) deployed. For now I will give an overview of the aggregation architecture, and will dig deeper into details in future articles.

The following figure represents the aggregation driver and how it relates to other Nemo components. It does not represent the data paths, which are slightly different than what is represented there.

The MAC layer, which is part of GLDv3, is the central point of access to Network Interface Cards (NICs) in the kernel. At the top, it provides a client interface that allows a client to send and receive packets to and from NICs, as well as configure, stop and start NICs. A the bottom, the MAC layer provides a provider interface which is used by NIC drivers to interface with the network stack. In the figure above, the client is the Data-Link Service (DLS) which provides SAP demultiplexing and VLAN support for the rest of the stack. The Data-Link Driver (DLD) provides a STREAMS interface between Nemo and DLPI consumers. We'll get into more details on DLS and DLD, which are also part of GLDv3, in a future article. Sunay also posted a general description of these components in his blog.

The core of the link aggregation feature is provided by the "aggr" kernel pseudo driver. This driver acts as both a MAC client and a MAC provider. The aggr driver implements a MAC provider interface so that it looks like any other MAC device, which allows us to manage aggregation devices as if they were a regular NIC from the rest of Solaris. We'll discuss the source of aggr in a future article.

Each aggregation of NICs is called an "aggregation group". Aggregation groups are uniquely identified by a key, an integer value unique on the system. We'll talk more about key values when we get into the administrative model. Note there is only one pseudo instance of the aggr driver. Each aggregation group is instanciated as a MAC port of that pseudo instance. Aggregations are managed (i.e. created, deleted, modified, queried) by the dladm(1M) command line utility, which communicates with the aggregation driver through a private control interface.

The aggregation group is also a consumer of the MAC client interface, which it uses to control the individual NICs that are part of aggregation groups. The aggregation driver controls (starts/stops/etc) individual NICs, and sets the MAC address of the individual NICs according to the MAC address of the aggregation itself, which can be automatically picked from one of the constituents ports, or set statically through dladm(1M). The aggregation driver also specifies the send and receive routines needed for the transmission of packets through the aggregation.

Another advantage of the MAC layer and its use by the aggregation driver is that any GLDv3 can be part of an aggregation, without any special support. Currently bge (1Gb/s Broadcom based), xge (10 Gb/s Neterion based), and e1000g (1Gb/s Intel based) devices can be combined to form link aggregations.

Obviously there's a lot more to talk about. Stay tuned for future articles of this series with more information on the administration model, detailed design issues, and the data-path. Thanks for listening...

Technorati Tag:
Technorati Tag:

About

ndroux

Search

Categories
  • General
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today