Wednesday Feb 11, 2009

flow control with multiple Tx rings

Phase 1 of Project Crossbow has been putback. As a core member of the project team plus the project gate keeper, I was involved in practically every aspect of the project.

One of the areas that I worked on was designing and implementing the Tx data path (at layer 2). With 10Gbps NICs (and some 1Gbps NICs too) featuring multiple Tx rings, the Tx side needed to take advantage of this by being able to fanout outgoing traffic among the Tx rings and at the same time being able to handle flow control condition arising on a specific Tx ring without affecting other Tx rings. To accomplish this, the MAC client API on the send path takes a fanout hint and returns a cookie that identifies a blocked Tx ring under flow control condition. Details of the MAC client API can be found in the excellent Crossbow Network Virtualization document

mac_tx() is the the API that MAC clients need to use to send packets out on the wire.

mac_tx_cookie_t mac_tx(mac_client_handle_t mch, mblk_t \*mp_chain, uintptr_t hint, uint16_t flag, mblk_t \*\*ret_mp);

mac_tx() takes a hint which is used to fanout the outgoing traffic among Tx rings. On successful transmission, NULL is returned to the caller. If unsuccessful, a cookie (mac_tx_cookie_t) is returned. This uniquely identifies the blocked Tx ring.

Clients like TCP/UDP/IP using the MAC API can take advantage of this to do efficient flow control.

UDP flow control

UDP does a mix of STREAMS and direct function calls to send traffic out of the box. The STREAMS canput(9F) function is called before calling the direct function in DLD to send the packet down. Note that the direct function pointers between IP and DLD are exchanged as part of the IP plumbing operations during ifconfig command. The direct function pointer on the Tx side takes the fanout hint. UDP passes the address of the conn_t structure as fanout hint for UDP connections originating from the system. This fanout hint is used to fanout to a Tx ring. If a Tx ring gets flow blocked, a cookie is returned back to the caller to indicate the flow blocked condition. The cookie identifies a specific Tx ring that is blocked (This cookie is not used in the phase1 putback to do flow control). At this time, DLD sets QFULL condition on the write queue. This causes all future canput(9F) on DLD to fail including for those connections that were sending packets out of Tx rings that are not blocked. This leads to bad Tx performance for UDP under stress. This is more severe for small packets as small packets cause a Tx ring to run out of descriptors very quickly.

UDP flow control with Crossbow

The cookie that gets returned to IP is used to do flow control based on individual Tx ring. If a UDP conn gets blocked, then the cookie that is returned is used to hash into a blocked conn list onto which the blocked conn is placed. All other UDP conns that are sending on the same Tx ring and which get blocked will be placed on the same blocked list. Later when flow control is relieved, GLDv3 (MAC) layer makes a direct call into IP passing the cookie. The cookie is used to access the blocked conn list and re-start the blocked UDP conns. UDP conns that were sending on different Tx rings that were not blocked will not be affected and they continue sending packets.

Performance improvement

With the new per Tx ring UDP flow control in place, all NICs that have multiple Tx rings benefits. On Oplin card, we saw an improvement in the range of 20-25% for packets of sizes 16, 64, 128, 256 and 512.

Friday May 05, 2006

Soft Rings (pre-Crossbow or Crossbow Phase0)

Soft rings is a feature that I worked on recently and putback the changes into S10 update 2. This feature improves incoming network traffic performance. This is the worker thread model of processing packets. The incoming traffic is made to land on a soft ring and a worker thread will pick up the packet and deliver it to IP.

Let's for a minute go back and see what problem we are trying to solve:

The FireEngine architecture introduced a per-CPU synchronization mechanism called vertical perimeter inside TCP/IP module. These vertical perimeters are implemented using a serialization queue abstraction called squeue. A connection is bound to an instance of squeue when the connection is initialized. Afterwards all packets for the connection are always processed on the same squeue. In the case of new incoming connections, they get bound to the squeue of the CPU that took the interrupt. This helps achieve better cache locality and increased network performance.

Now on systems consisting of slow cpus (CPU speed less than 1 Ghz), a single CPU will not be able to handle incoming load of 1 Gbps. On the other hand, even faster CPUs will not be able to handle loads generated by 10 Gbps NICs. The solution would be to fanout the load to be handled by multiple CPUs.

The current solution of enabling this by setting ip_squeue_fanout to 1 is suboptimal (or rather one can say it is broken). With ip_squeue_fanout set to 1, for new incoming TCP connections a random squeue that could belong to any one of the CPUs in the system gets selected and then the packet could get processed in the same context. This is bad because what you want here is to have the other CPU to do the processing of the packets belonging to its squeue.

The problem is addresses by soft rings. Soft rings is an abstraction that simulates hardware Rx ring functionality in software. Multiple soft rings can be configured on a system (tunable: ip_soft_rings_cnt). By default 2 soft rings are configured. Incoming traffic is made to land on one of the soft rings. The soft ring will have pointer to the right squeue to which the packet has to be delivered. A worker thread will be created for each soft ring and this worker thread will pick up the packet from the soft ring and deliver it to IP. The worker thread will have affinity to the CPU to which the squeue belong. All this helps in efficient processing of the packets.

Other considerations:

Fanout based on the hardware/platform:

Consider Niagara processors. Niagara processor contain multiple cores in a single chip. Each core in turn can process 4 threads. When handling software fanout, due consideration is given to tie in the incoming data to be handled by threads (these thread are counted as CPUs) in the same core that took the interrupt. This would help preserve interrupt to cpu/core affinity.

Same is the case with AMD dual core processors. It would be optimal if the load can be fanned out to CPUs on the same core to capitalize on the shared L2 cache.

How to enable soft rings ?

You need to have Solaris 10 update 2.

On Niagara platforms (T1000 and T2000s), it is enabled by default.

On other platforms, it can be enabled by setting ip_squeue_fanout to 1.

ip_soft_rings_cnt has a default value of 2. A value of 2 or 3 has been found to be optimal for getting good performance on 1Gbps NICs on the Niagara platforms.

About myself

My name is Rajagopal Kunhappan. People call me by Raj, Gopi and even address me by my login name krgopi. Let me say any of these is fine but my preference would be Gopi as that is the oldest short name for me.

I work in the network performance team -- the charter of this team is to keep continuously innovating to improve network performance and to keep up with new fast hardware like 10Gbps NICs. This team has delivered the FireEngine architecture (TCP/IP performance), Yosemite (UDP performance), Nemo aka GLDv3 (high performance network framework), etc.

Presently I am working as part of CrossBow team. The soft rings feature on which I will blog soon is the first CrossBow deliverable (which we called as Phase 0). I hope to blog actively and on things other than technical too. Anyways "Hello world!"




« April 2014