Wednesday Nov 16, 2011

Exploring TCP throughput with DTrace (2)

Last time, I described how we can use the overlap in distributions of unacknowledged byte counts and send window to determine whether the peer's receive window may be too small, limiting throughput. Let's combine that comparison with a comparison of congestion window and slow start threshold, all on a per-port/per-client basis. This will help us

  • Identify whether the congestion window or the receive window are limiting factors on throughput by comparing the distributions of congestion window and send window values to the distribution of outstanding (unacked) bytes. This will allow us to get a visual sense for how often we are thwarted in our attempts to fill the pipe due to congestion control versus the peer not being able to receive any more data.
  • Identify whether slow start or congestion avoidance predominate by comparing the overlap in the congestion window and slow start distributions. If the slow start threshold distribution overlaps with the congestion window, we know that we have switched between slow start and congestion avoidance, possibly multiple times.
  • Identify whether the peer's receive window is too small by comparing the distribution of outstanding unacked bytes with the send window distribution (i.e. the peer's receive window). I discussed this here.

# dtrace -s tcp_window.d
dtrace: script 'tcp_window.d' matched 10 probes
^C

  cwnd                                                  80  10.175.96.92                                      
           value  ------------- Distribution ------------- count    
            1024 |                                         0        
            2048 |                                         4        
            4096 |                                         6        
            8192 |                                         18       
           16384 |                                         36       
           32768 |@                                        79       
           65536 |@                                        155      
          131072 |@                                        199      
          262144 |@@@                                      400      
          524288 |@@@@@@                                   798      
         1048576 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@             3848     
         2097152 |                                         0        

  ssthresh                                              80  10.175.96.92                                      
           value  ------------- Distribution ------------- count    
       268435456 |                                         0        
       536870912 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 5543     
      1073741824 |                                         0        

  unacked                                               80  10.175.96.92                                      
           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |                                         1        
               1 |                                         0        
               2 |                                         0        
               4 |                                         0        
               8 |                                         0        
              16 |                                         0        
              32 |                                         0        
              64 |                                         0        
             128 |                                         0        
             256 |                                         3        
             512 |                                         0         
            1024 |                                         0        
            2048 |                                         4        
            4096 |                                         9        
            8192 |                                         21       
           16384 |                                         36       
           32768 |@                                        78       
           65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  5391     
          131072 |                                         0        

  swnd                                                  80  10.175.96.92                                      
           value  ------------- Distribution ------------- count    
           32768 |                                         0        
           65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 5543     
          131072 |                                         0        

Here we are observing a large file transfer via http on the webserver. Comparing these distributions, we can observe:

  • That slow start congestion control is in operation. The distribution of congestion window values lies below the range of slow start threshold values (which are in the 536870912+ range), so the connection is in slow start mode.
  • Both the unacked byte count and the send window values peak in the 65536-131071 range, but the send window value distribution is narrower. This tells us that the peer TCP's receive window is not closing.
  • The congestion window distribution peaks in the 1048576 - 2097152 range while the receive window distribution is confined to the 65536-131071 range. Since the cwnd distribution ranges as low as 2048-4095, we can see that for some of the time we have been observing the connection, congestion control has been a limiting factor on transfer, but for the majority of the time the receive window of the peer would more likely have been the limiting factor. However, we know the window has never closed as the distribution of swnd values stays within the 65536-131071 range.
So all in all we have a connection that has been mildly constrained by congestion control, but for the bulk of the time we have been observing it neither congestion or peer receive window have limited throughput. Here's the script:

#!/usr/sbin/dtrace -s


tcp:::send
/ (args[4]->tcp_flags & (TH_SYN|TH_RST|TH_FIN)) == 0 /
{
        @cwnd["cwnd", args[4]->tcp_sport, args[2]->ip_daddr] =
            quantize(args[3]->tcps_cwnd);
        @ssthresh["ssthresh", args[4]->tcp_sport, args[2]->ip_daddr] =
            quantize(args[3]->tcps_cwnd_ssthresh);
        @unacked["unacked", args[4]->tcp_sport, args[2]->ip_daddr] =
            quantize(args[3]->tcps_snxt - args[3]->tcps_suna);
        @swnd["swnd", args[4]->tcp_sport, args[2]->ip_daddr] =
            quantize((args[4]->tcp_window)*(1 << args[3]->tcps_snd_ws));
}

One surprise here is that slow start is still in operation - one would assume that for a large file transfer, acknowledgements would push the congestion window up past the slow start threshold over time. The slow start threshold is in fact still close to it's initial (very high) value, so that would suggest we have not experienced any congestion (the slow start threshold is adjusted when congestion occurs). Also, the above measurements were taken early in the connection lifetime, so the congestion window did not get a changes to get bumped up to the level of the slow start threshold.

A good strategy when examining these sorts of measurements for a given service (such as a webserver) would be start by examining the distributions above aggregated by port number only to get an overall feel for service performance, i.e. is congestion control or peer receive window size an issue, or are we unconstrained to fill the pipe? From there, the overlap of distributions will tell us whether to drill down into specific clients. For example if the send window distribution has multiple peaks, we may want to examine if particular clients show issues with their receive window.

Monday Nov 14, 2011

Observing flow control idle time in TCP

Previously I described how to observe congestion control strategies during transmission, and here I talked about TCP's sliding window approach for handling flow control on the receive side. A neat trick would now be to put the pieces together and ask the following question - how often is TCP transmission blocked by congestion control (send-side flow control) versus a zero-sized send window (which is the receiver saying it cannot process any more data)? So in effect we are asking whether the size of the receive window of the peer or the congestion control strategy may be sub-optimal. The result of such a problem would be that we have TCP data that we could be transmitting but we are not, potentially effecting throughput.

So flow control is in effect:

  • when the congestion window is less than or equal to the amount of bytes outstanding on the connection. We can derive this from args[3]->tcps_snxt - args[3]->tcps_suna, i.e. the difference between the next sequence number to send and the lowest unacknowledged sequence number; and
  • when the window in the TCP segment received is advertised as 0
We time from these events until we send new data (i.e. args[4]->tcp_seq >= snxt value when window closes. Here's the script:

#!/usr/sbin/dtrace -s

#pragma D option quiet


tcp:::send
/ (args[3]->tcps_snxt - args[3]->tcps_suna) >= args[3]->tcps_cwnd /
{
        cwndclosed[args[1]->cs_cid] = timestamp;
        cwndsnxt[args[1]->cs_cid] = args[3]->tcps_snxt;
        @numclosed["cwnd", args[2]->ip_daddr, args[4]->tcp_dport] = count();
}

tcp:::send
/ cwndclosed[args[1]->cs_cid] && args[4]->tcp_seq >= cwndsnxt[args[1]->cs_cid] /
{
        @meantimeclosed["cwnd", args[2]->ip_daddr, args[4]->tcp_dport] =
            avg(timestamp - cwndclosed[args[1]->cs_cid]);
        @stddevtimeclosed["cwnd", args[2]->ip_daddr, args[4]->tcp_dport] =
            stddev(timestamp - cwndclosed[args[1]->cs_cid]);
        @numclosed["cwnd", args[2]->ip_daddr, args[4]->tcp_dport] = count();
        cwndclosed[args[1]->cs_cid] = 0;
        cwndsnxt[args[1]->cs_cid] = 0;
}

tcp:::receive
/ args[4]->tcp_window == 0 &&
  (args[4]->tcp_flags & (TH_SYN|TH_RST|TH_FIN)) == 0 /
{
        swndclosed[args[1]->cs_cid] = timestamp;
        swndsnxt[args[1]->cs_cid] = args[3]->tcps_snxt;
        @numclosed["swnd", args[2]->ip_saddr, args[4]->tcp_dport] = count();
}

tcp:::send
/ swndclosed[args[1]->cs_cid] && args[4]->tcp_seq >= swndsnxt[args[1]->cs_cid] /
{
        @meantimeclosed["swnd", args[2]->ip_daddr, args[4]->tcp_sport] =
            avg(timestamp - swndclosed[args[1]->cs_cid]);
        @stddevtimeclosed["swnd", args[2]->ip_daddr, args[4]->tcp_sport] =
            stddev(timestamp - swndclosed[args[1]->cs_cid]);
        swndclosed[args[1]->cs_cid] = 0;
        swndsnxt[args[1]->cs_cid] = 0;
}


END
{
        printf("%-6s %-20s %-8s %-25s %-8s %-8s\n", "Window",
            "Remote host", "Port", "TCP Avg WndClosed(ns)", "StdDev",
            "Num");
        printa("%-6s %-20s %-8d %@-25d %@-8d %@-8d\n", @meantimeclosed,
            @stddevtimeclosed, @numclosed);
}

So this script will show us whether the peer's receive window size is preventing flow ("swnd" events) or whether congestion control is limiting flow ("cwnd" events). As an example I traced on a server with a large file transfer in progress via a webserver and with an active ssh connection running "find / -depth -print". Here is the output:

^C
Window Remote host          Port     TCP Avg WndClosed(ns)     StdDev   Num     
cwnd   10.175.96.92         80       86064329                  77311705 125     
cwnd   10.175.96.92         22       122068522                 151039669 81  

So we see in this case, the congestion window closes 125 times for port 80 connections and 81 times for ssh. The average time the window is closed is 0.086sec for port 80 and 0.12sec for port 22.

So if you wish to change congestion control algorithm in Oracle Solaris 11, a useful step may be to see if congestion really is an issue on your network. Scripts like the one posted above can help assess this, but it's worth reiterating that if congestion control is occuring, that's not necessarily a problem that needs fixing. Recall that congestion control is about controlling flow to prevent large-scale drops, so looking at congestion events in isolation doesn't tell us the whole story. For example, are we seeing more congestion events with one control algorithm, but more drops/retransmission with another? As always, it's best to start with measures of throughput and latency before arriving at a specific hypothesis such as "my congestion control algorithm is sub-optimal".

Wednesday Jun 23, 2010

DTrace TCP provider and TCP's sliding window

In the last post, I mentioned I'd demonstrate how we can probe TCP sliding window behaviour. When examining performance problems relating to sliding window on a system, the key thing to determine is if we are encountering situations where either:

  • the send window of any TCP connections is zero. This indicates we can no longer send data since the peer connection is advertising a zero-sized window.
  • the receive window of any TCP connections is zero. This indicates we can no longer receive data since we are advertising a zero-sized window.

If either of these situations occurs regularly, we will then need to dig down and find out why the send or receive windows are filling up, but the first step is to determine if these events are occuring, and if so, how long are we stuck unable to send or receive new data? We can do this by timing the interval between zero window advertisement send and new data receive (for receive window) and receipt of a zero window advertisement and new data send (for send window). Here is a script which averages zero receive window idletime by remote address/port. This is the average time between when we advertise a zero window and when we resume receiving new data.

#!/usr/sbin/dtrace -s

#pragma D option quiet


tcp:::send
/ args[4]->tcp_window == 0 && (args[4]->tcp_flags & TH_RST) == 0 /
{
        rwndclosed[args[1]->cs_cid] = timestamp;
	rwndrnxt[args[1]->cs_cid] = args[3]->tcps_rnxt;
        @numrwndclosed[args[2]->ip_daddr, args[4]->tcp_dport] = count();
}

tcp:::receive
/ rwndclosed[args[1]->cs_cid] && args[4]->tcp_seq >= rwndrnxt[args[1]->cs_cid] /
{
        @meantimeclosed[args[2]->ip_saddr, args[4]->tcp_sport] =
            avg(timestamp - rwndclosed[args[1]->cs_cid]);
	@stddevtimeclosed[args[2]->ip_saddr, args[4]->tcp_sport] =
            stddev(timestamp - rwndclosed[args[1]->cs_cid]);
	rwndclosed[args[1]->cs_cid] = 0;
        rwndrnxt[args[1]->cs_cid] = 0;
}

END
{
        printf("%-20s %-8s %-25s %-8s %-8s\\n",
            "Remote host", "Port", "TCP Avg RwndClosed(ns)", "StdDev",
            "Num");
        printa("%-20s %-8d %@-25d %@-8d %@-8d\\n", @meantimeclosed,
            @stddevtimeclosed, @numrwndclosed);
}

Here's the output showing one receive window close event, triggered by opening a hi-res YouTube video in a browser.

# dtrace -s tcp_rwndclosed.d
\^C
Remote host          Port     TCP Avg RwndClosed(ns)    StdDev   Num     
92.122.127.159       80       26914620                  0        1       

We can see we were stuck unable to receive new data for 0.269 seconds since our receive window was 0. Here's the send window equivalent:

#!/usr/sbin/dtrace -s

#pragma D option quiet


tcp:::receive
/ args[4]->tcp_window == 0 && (args[4]->tcp_flags & TH_RST) == 0 /
{
        swndclosed[args[1]->cs_cid] = timestamp;
	swndsnxt[args[1]->cs_cid] = args[3]->tcps_snxt;
	@numswndclosed[args[2]->ip_saddr, args[4]->tcp_sport] = count();
}

tcp:::send
/ swndclosed[args[1]->cs_cid] && args[4]->tcp_seq >= swndsnxt[args[1]->cs_cid] /
{
        @meantimeclosed[args[2]->ip_daddr, args[4]->tcp_dport] =
            avg(timestamp - swndclosed[args[1]->cs_cid]);
	@stddevtimeclosed[args[2]->ip_daddr, args[4]->tcp_dport] =
            stddev(timestamp - swndclosed[args[1]->cs_cid]);
	swndclosed[args[1]->cs_cid] = 0;
	swndsnxt[args[1]->cs_cid] = 0;
}

END
{
        printf("%-20s %-8s %-25s %-8s %-8s\\n",
            "Remote host", "Port", "TCP Avg SwndClosed(ns)", "StdDev",
            "Num");
        printa("%-20s %-8d %@-25d %@-8d %@-8d\\n", @meantimeclosed,
            @stddevtimeclosed, @numswndclosed);
}

If a lot of zero-window events are occuring, a good next step would be to use the acknowledgement RTT latency measurement script in the previous entry - this will help determine if the ACK RTT is sufficiently high such that the send window fills up before any data is acknowledged.

About

user12820842

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today