Wednesday Nov 16, 2011

Exploring TCP throughput with DTrace (2)

Last time, I described how we can use the overlap in distributions of unacknowledged byte counts and send window to determine whether the peer's receive window may be too small, limiting throughput. Let's combine that comparison with a comparison of congestion window and slow start threshold, all on a per-port/per-client basis. This will help us

  • Identify whether the congestion window or the receive window are limiting factors on throughput by comparing the distributions of congestion window and send window values to the distribution of outstanding (unacked) bytes. This will allow us to get a visual sense for how often we are thwarted in our attempts to fill the pipe due to congestion control versus the peer not being able to receive any more data.
  • Identify whether slow start or congestion avoidance predominate by comparing the overlap in the congestion window and slow start distributions. If the slow start threshold distribution overlaps with the congestion window, we know that we have switched between slow start and congestion avoidance, possibly multiple times.
  • Identify whether the peer's receive window is too small by comparing the distribution of outstanding unacked bytes with the send window distribution (i.e. the peer's receive window). I discussed this here.

# dtrace -s tcp_window.d
dtrace: script 'tcp_window.d' matched 10 probes

  cwnd                                                  80                                      
           value  ------------- Distribution ------------- count    
            1024 |                                         0        
            2048 |                                         4        
            4096 |                                         6        
            8192 |                                         18       
           16384 |                                         36       
           32768 |@                                        79       
           65536 |@                                        155      
          131072 |@                                        199      
          262144 |@@@                                      400      
          524288 |@@@@@@                                   798      
         1048576 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@             3848     
         2097152 |                                         0        

  ssthresh                                              80                                      
           value  ------------- Distribution ------------- count    
       268435456 |                                         0        
       536870912 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 5543     
      1073741824 |                                         0        

  unacked                                               80                                      
           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |                                         1        
               1 |                                         0        
               2 |                                         0        
               4 |                                         0        
               8 |                                         0        
              16 |                                         0        
              32 |                                         0        
              64 |                                         0        
             128 |                                         0        
             256 |                                         3        
             512 |                                         0         
            1024 |                                         0        
            2048 |                                         4        
            4096 |                                         9        
            8192 |                                         21       
           16384 |                                         36       
           32768 |@                                        78       
           65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  5391     
          131072 |                                         0        

  swnd                                                  80                                      
           value  ------------- Distribution ------------- count    
           32768 |                                         0        
           65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 5543     
          131072 |                                         0        

Here we are observing a large file transfer via http on the webserver. Comparing these distributions, we can observe:

  • That slow start congestion control is in operation. The distribution of congestion window values lies below the range of slow start threshold values (which are in the 536870912+ range), so the connection is in slow start mode.
  • Both the unacked byte count and the send window values peak in the 65536-131071 range, but the send window value distribution is narrower. This tells us that the peer TCP's receive window is not closing.
  • The congestion window distribution peaks in the 1048576 - 2097152 range while the receive window distribution is confined to the 65536-131071 range. Since the cwnd distribution ranges as low as 2048-4095, we can see that for some of the time we have been observing the connection, congestion control has been a limiting factor on transfer, but for the majority of the time the receive window of the peer would more likely have been the limiting factor. However, we know the window has never closed as the distribution of swnd values stays within the 65536-131071 range.
So all in all we have a connection that has been mildly constrained by congestion control, but for the bulk of the time we have been observing it neither congestion or peer receive window have limited throughput. Here's the script:

#!/usr/sbin/dtrace -s

/ (args[4]->tcp_flags & (TH_SYN|TH_RST|TH_FIN)) == 0 /
        @cwnd["cwnd", args[4]->tcp_sport, args[2]->ip_daddr] =
        @ssthresh["ssthresh", args[4]->tcp_sport, args[2]->ip_daddr] =
        @unacked["unacked", args[4]->tcp_sport, args[2]->ip_daddr] =
            quantize(args[3]->tcps_snxt - args[3]->tcps_suna);
        @swnd["swnd", args[4]->tcp_sport, args[2]->ip_daddr] =
            quantize((args[4]->tcp_window)*(1 << args[3]->tcps_snd_ws));

One surprise here is that slow start is still in operation - one would assume that for a large file transfer, acknowledgements would push the congestion window up past the slow start threshold over time. The slow start threshold is in fact still close to it's initial (very high) value, so that would suggest we have not experienced any congestion (the slow start threshold is adjusted when congestion occurs). Also, the above measurements were taken early in the connection lifetime, so the congestion window did not get a changes to get bumped up to the level of the slow start threshold.

A good strategy when examining these sorts of measurements for a given service (such as a webserver) would be start by examining the distributions above aggregated by port number only to get an overall feel for service performance, i.e. is congestion control or peer receive window size an issue, or are we unconstrained to fill the pipe? From there, the overlap of distributions will tell us whether to drill down into specific clients. For example if the send window distribution has multiple peaks, we may want to examine if particular clients show issues with their receive window.

Monday Nov 14, 2011

Observing flow control idle time in TCP

Previously I described how to observe congestion control strategies during transmission, and here I talked about TCP's sliding window approach for handling flow control on the receive side. A neat trick would now be to put the pieces together and ask the following question - how often is TCP transmission blocked by congestion control (send-side flow control) versus a zero-sized send window (which is the receiver saying it cannot process any more data)? So in effect we are asking whether the size of the receive window of the peer or the congestion control strategy may be sub-optimal. The result of such a problem would be that we have TCP data that we could be transmitting but we are not, potentially effecting throughput.

So flow control is in effect:

  • when the congestion window is less than or equal to the amount of bytes outstanding on the connection. We can derive this from args[3]->tcps_snxt - args[3]->tcps_suna, i.e. the difference between the next sequence number to send and the lowest unacknowledged sequence number; and
  • when the window in the TCP segment received is advertised as 0
We time from these events until we send new data (i.e. args[4]->tcp_seq >= snxt value when window closes. Here's the script:

#!/usr/sbin/dtrace -s

#pragma D option quiet

/ (args[3]->tcps_snxt - args[3]->tcps_suna) >= args[3]->tcps_cwnd /
        cwndclosed[args[1]->cs_cid] = timestamp;
        cwndsnxt[args[1]->cs_cid] = args[3]->tcps_snxt;
        @numclosed["cwnd", args[2]->ip_daddr, args[4]->tcp_dport] = count();

/ cwndclosed[args[1]->cs_cid] && args[4]->tcp_seq >= cwndsnxt[args[1]->cs_cid] /
        @meantimeclosed["cwnd", args[2]->ip_daddr, args[4]->tcp_dport] =
            avg(timestamp - cwndclosed[args[1]->cs_cid]);
        @stddevtimeclosed["cwnd", args[2]->ip_daddr, args[4]->tcp_dport] =
            stddev(timestamp - cwndclosed[args[1]->cs_cid]);
        @numclosed["cwnd", args[2]->ip_daddr, args[4]->tcp_dport] = count();
        cwndclosed[args[1]->cs_cid] = 0;
        cwndsnxt[args[1]->cs_cid] = 0;

/ args[4]->tcp_window == 0 &&
  (args[4]->tcp_flags & (TH_SYN|TH_RST|TH_FIN)) == 0 /
        swndclosed[args[1]->cs_cid] = timestamp;
        swndsnxt[args[1]->cs_cid] = args[3]->tcps_snxt;
        @numclosed["swnd", args[2]->ip_saddr, args[4]->tcp_dport] = count();

/ swndclosed[args[1]->cs_cid] && args[4]->tcp_seq >= swndsnxt[args[1]->cs_cid] /
        @meantimeclosed["swnd", args[2]->ip_daddr, args[4]->tcp_sport] =
            avg(timestamp - swndclosed[args[1]->cs_cid]);
        @stddevtimeclosed["swnd", args[2]->ip_daddr, args[4]->tcp_sport] =
            stddev(timestamp - swndclosed[args[1]->cs_cid]);
        swndclosed[args[1]->cs_cid] = 0;
        swndsnxt[args[1]->cs_cid] = 0;

        printf("%-6s %-20s %-8s %-25s %-8s %-8s\n", "Window",
            "Remote host", "Port", "TCP Avg WndClosed(ns)", "StdDev",
        printa("%-6s %-20s %-8d %@-25d %@-8d %@-8d\n", @meantimeclosed,
            @stddevtimeclosed, @numclosed);

So this script will show us whether the peer's receive window size is preventing flow ("swnd" events) or whether congestion control is limiting flow ("cwnd" events). As an example I traced on a server with a large file transfer in progress via a webserver and with an active ssh connection running "find / -depth -print". Here is the output:

Window Remote host          Port     TCP Avg WndClosed(ns)     StdDev   Num     
cwnd         80       86064329                  77311705 125     
cwnd         22       122068522                 151039669 81  

So we see in this case, the congestion window closes 125 times for port 80 connections and 81 times for ssh. The average time the window is closed is 0.086sec for port 80 and 0.12sec for port 22.

So if you wish to change congestion control algorithm in Oracle Solaris 11, a useful step may be to see if congestion really is an issue on your network. Scripts like the one posted above can help assess this, but it's worth reiterating that if congestion control is occuring, that's not necessarily a problem that needs fixing. Recall that congestion control is about controlling flow to prevent large-scale drops, so looking at congestion events in isolation doesn't tell us the whole story. For example, are we seeing more congestion events with one control algorithm, but more drops/retransmission with another? As always, it's best to start with measures of throughput and latency before arriving at a specific hypothesis such as "my congestion control algorithm is sub-optimal".

Thursday Nov 10, 2011

DTracing TCP congestion control

In a previous post, I showed how we can use DTrace to probe TCP receive and send window events. TCP receive and send windows are in effect both about flow-controlling how much data can be received - the receive window reflects how much data the local TCP is prepared to receive, while the send window simply reflects the size of the receive window of the peer TCP. Both then represent flow control as imposed by the receiver.

However, consider that without the sender imposing flow control, and a slow link to a peer, TCP will simply fill up it's window with sent segments. Dealing with multiple TCP implementations filling their peer TCP's receive windows in this manner, busy intermediate routers may drop some of these segments, leading to timeout and retransmission, which may again lead to drops. This is termed congestion, and TCP has multiple congestion control strategies. We can see that in this example, we need to have some way of adjusting how much data we send depending on how quickly we receive acknowledgement - if we get ACKs quickly, we can safely send more segments, but if acknowledgements come slowly, we should proceed with more caution. More generally, we need to implement flow control on the send side also.

Slow Start and Congestion Avoidance

From RFC2581, let's examine the relevant variables:

"The congestion window (cwnd) is a sender-side limit on the amount of data the sender can transmit into the network before receiving an acknowledgment (ACK). Another state variable, the slow start threshold (ssthresh), is used to determine whether the slow start or congestion avoidance algorithm is used to control data transmission"

Slow start is used to probe the network's ability to handle transmission bursts both when a connection is first created and when retransmission timers fire. The latter case is important, as the fact that we have effectively lost TCP data acts as a motivator for re-probing how much data the network can handle from the sending TCP. The congestion window (cwnd) is initialized to a relatively small value, generally a low multiple of the sending maximum segment size. When slow start kicks in, we will only send that number of bytes before waiting for acknowledgement. When acknowledgements are received, the congestion window is increased in size until cwnd reaches the slow start threshold ssthresh value. For most congestion control algorithms the window increases exponentially under slow start, assuming we receive acknowledgements. We send 1 segment, receive an ACK, increase the cwnd by 1 MSS to 2*MSS, send 2 segments, receive 2 ACKs, increase the cwnd by 2*MSS to 4*MSS, send 4 segments etc. When the congestion window exceeds the slow start threshold, congestion avoidance is used instead of slow start.

During congestion avoidance, the congestion window is generally updated by one MSS for each round-trip-time as opposed to each ACK, and so cwnd growth is linear instead of exponential (we may receive multiple ACKs within a single RTT). This continues until congestion is detected. If a retransmit timer fires, congestion is assumed and the ssthresh value is reset. It is reset to a fraction of the number of bytes outstanding (unacknowledged) in the network. At the same time the congestion window is reset to a single max segment size. Thus, we initiate slow start until we start receiving acknowledgements again, at which point we can eventually flip over to congestion avoidance when cwnd > ssthresh.

Congestion control algorithms differ most in how they handle the other indication of congestion - duplicate ACKs. A duplicate ACK is a strong indication that data has been lost, since they often come from a receiver explicitly asking for a retransmission. In some cases, a duplicate ACK may be generated at the receiver as a result of packets arriving out-of-order, so it is sensible to wait for multiple duplicate ACKs before assuming packet loss rather than out-of-order delivery. This is termed fast retransmit (i.e. retransmit without waiting for the retransmission timer to expire). Note that on Oracle Solaris 11, the congestion control method used can be customized. See here for more details. In general, 3 or more duplicate ACKs indicate packet loss and should trigger fast retransmit . It's best not to revert to slow start in this case, as the fact that the receiver knew it was missing data suggests it has received data with a higher sequence number, so we know traffic is still flowing. Falling back to slow start would be excessive therefore, so fast recovery is used instead.

Observing slow start and congestion avoidance

The following script counts TCP segments sent when under slow start (cwnd <= ssthresh) versus congestion avoidance (cwnd > ssthresh).

#!/usr/sbin/dtrace -s

#pragma D option quiet

/ start[args[1]->cs_cid] == 0/
        start[args[1]->cs_cid] = 1;
/ start[args[1]->cs_cid] == 1 &&
  args[3]->tcps_cwnd <= args[3]->tcps_cwnd_ssthresh /
        @c["Slow start", args[2]->ip_daddr, args[4]->tcp_dport] = count();

/ start[args[1]->cs_cid] == 1 &&
  args[3]->tcps_cwnd > args[3]->tcps_cwnd_ssthresh /
        @c["Congestion avoidance", args[2]->ip_daddr, args[4]->tcp_dport] =

As we can see the script only works on connections initiated since it is started (using the start[] associative array with the connection ID as index to set whether it's a new connection (start[cid] = 1). From there we simply differentiate send events where cwnd <= ssthresh (slow start) versus those where cwnd > ssthresh (congestion avoidance). Here's the output taken when I accessed a YouTube video (where rport is 80) and from an FTP session where I put a large file onto a remote system.

# dtrace -s tcp_slow_start.d
ALGORITHM                 RADDR                     RPORT      #SEG
Slow start                     20         6
Slow start                        80        14
Slow start                     21        18
Congestion avoidance               20      1164

We see that in the case of the YouTube video, slow start was exclusively used. Most of the segments we sent in that case were likely ACKs. Compare this case - where 14 segments were sent using slow start - to the FTP case, where only 6 segments were sent before we switched to congestion avoidance for 1164 segments. In the case of the FTP session, the FTP data on port 20 was predominantly sent with congestion avoidance in operation, while the FTP session relied exclusively on slow start.

For the default congestion control algorithm - "newreno" - on Solaris 11, slow start will increase the cwnd by 1 MSS for every acknowledgement received, and by 1 MSS for each RTT in congestion avoidance mode. Different pluggable congestion control algorithms operate slightly differently. For example "highspeed" will update the slow start cwnd by the number of bytes ACKed rather than the MSS.

And to finish, here's a neat oneliner to visually display the distribution of congestion window values for all TCP connections to a given remote port using a quantization. In this example, only port 80 is in use and we see the majority of cwnd values for that port are in the 4096-8191 range.

# dtrace -n 'tcp:::send { @q[args[4]->tcp_dport] = quantize(args[3]->tcps_cwnd); }'
dtrace: description 'tcp:::send ' matched 10 probes

           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |@@@@@@                                   5        
               1 |                                         0        
               2 |                                         0        
               4 |                                         0        
               8 |                                         0        
              16 |                                         0        
              32 |                                         0        
              64 |                                         0        
             128 |                                         0        
             256 |                                         0        
             512 |                                         0        
            1024 |                                         0        
            2048 |@@@@@@@@@                                8        
            4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@               23       
            8192 |                                         0        




« July 2016