Wednesday Nov 16, 2011

Exploring TCP throughput with DTrace (2)

Last time, I described how we can use the overlap in distributions of unacknowledged byte counts and send window to determine whether the peer's receive window may be too small, limiting throughput. Let's combine that comparison with a comparison of congestion window and slow start threshold, all on a per-port/per-client basis. This will help us

  • Identify whether the congestion window or the receive window are limiting factors on throughput by comparing the distributions of congestion window and send window values to the distribution of outstanding (unacked) bytes. This will allow us to get a visual sense for how often we are thwarted in our attempts to fill the pipe due to congestion control versus the peer not being able to receive any more data.
  • Identify whether slow start or congestion avoidance predominate by comparing the overlap in the congestion window and slow start distributions. If the slow start threshold distribution overlaps with the congestion window, we know that we have switched between slow start and congestion avoidance, possibly multiple times.
  • Identify whether the peer's receive window is too small by comparing the distribution of outstanding unacked bytes with the send window distribution (i.e. the peer's receive window). I discussed this here.

# dtrace -s tcp_window.d
dtrace: script 'tcp_window.d' matched 10 probes
^C

  cwnd                                                  80  10.175.96.92                                      
           value  ------------- Distribution ------------- count    
            1024 |                                         0        
            2048 |                                         4        
            4096 |                                         6        
            8192 |                                         18       
           16384 |                                         36       
           32768 |@                                        79       
           65536 |@                                        155      
          131072 |@                                        199      
          262144 |@@@                                      400      
          524288 |@@@@@@                                   798      
         1048576 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@             3848     
         2097152 |                                         0        

  ssthresh                                              80  10.175.96.92                                      
           value  ------------- Distribution ------------- count    
       268435456 |                                         0        
       536870912 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 5543     
      1073741824 |                                         0        

  unacked                                               80  10.175.96.92                                      
           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |                                         1        
               1 |                                         0        
               2 |                                         0        
               4 |                                         0        
               8 |                                         0        
              16 |                                         0        
              32 |                                         0        
              64 |                                         0        
             128 |                                         0        
             256 |                                         3        
             512 |                                         0         
            1024 |                                         0        
            2048 |                                         4        
            4096 |                                         9        
            8192 |                                         21       
           16384 |                                         36       
           32768 |@                                        78       
           65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  5391     
          131072 |                                         0        

  swnd                                                  80  10.175.96.92                                      
           value  ------------- Distribution ------------- count    
           32768 |                                         0        
           65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 5543     
          131072 |                                         0        

Here we are observing a large file transfer via http on the webserver. Comparing these distributions, we can observe:

  • That slow start congestion control is in operation. The distribution of congestion window values lies below the range of slow start threshold values (which are in the 536870912+ range), so the connection is in slow start mode.
  • Both the unacked byte count and the send window values peak in the 65536-131071 range, but the send window value distribution is narrower. This tells us that the peer TCP's receive window is not closing.
  • The congestion window distribution peaks in the 1048576 - 2097152 range while the receive window distribution is confined to the 65536-131071 range. Since the cwnd distribution ranges as low as 2048-4095, we can see that for some of the time we have been observing the connection, congestion control has been a limiting factor on transfer, but for the majority of the time the receive window of the peer would more likely have been the limiting factor. However, we know the window has never closed as the distribution of swnd values stays within the 65536-131071 range.
So all in all we have a connection that has been mildly constrained by congestion control, but for the bulk of the time we have been observing it neither congestion or peer receive window have limited throughput. Here's the script:

#!/usr/sbin/dtrace -s


tcp:::send
/ (args[4]->tcp_flags & (TH_SYN|TH_RST|TH_FIN)) == 0 /
{
        @cwnd["cwnd", args[4]->tcp_sport, args[2]->ip_daddr] =
            quantize(args[3]->tcps_cwnd);
        @ssthresh["ssthresh", args[4]->tcp_sport, args[2]->ip_daddr] =
            quantize(args[3]->tcps_cwnd_ssthresh);
        @unacked["unacked", args[4]->tcp_sport, args[2]->ip_daddr] =
            quantize(args[3]->tcps_snxt - args[3]->tcps_suna);
        @swnd["swnd", args[4]->tcp_sport, args[2]->ip_daddr] =
            quantize((args[4]->tcp_window)*(1 << args[3]->tcps_snd_ws));
}

One surprise here is that slow start is still in operation - one would assume that for a large file transfer, acknowledgements would push the congestion window up past the slow start threshold over time. The slow start threshold is in fact still close to it's initial (very high) value, so that would suggest we have not experienced any congestion (the slow start threshold is adjusted when congestion occurs). Also, the above measurements were taken early in the connection lifetime, so the congestion window did not get a changes to get bumped up to the level of the slow start threshold.

A good strategy when examining these sorts of measurements for a given service (such as a webserver) would be start by examining the distributions above aggregated by port number only to get an overall feel for service performance, i.e. is congestion control or peer receive window size an issue, or are we unconstrained to fill the pipe? From there, the overlap of distributions will tell us whether to drill down into specific clients. For example if the send window distribution has multiple peaks, we may want to examine if particular clients show issues with their receive window.

Exploring TCP throughput with DTrace

One key measure to use when assessing TCP throughput is assessing the amount of unacknowledged data in the pipe. In DTrace terms, the amount of unacknowledged data in bytes for the connection is the different between the next sequence number to send and the lowest unacknoweldged sequence number (tcps_snxt - tcps_suna). According to the theory, when the number of unacknowledged bytes for the connection is less than the receive window of the peer, the path bandwidth is the limiting factor for throughput. In other words, if we can fill the pipe without the peer TCP complaining (by virtue of its window size reaching 0), we are purely bandwidth-limited. If the peer's receive window is too small however, the sending TCP has to wait for acknowledgements before it can send more data. In this case the round-trip time (RTT) limits throughput. In such cases the effective throughput limit is the window size divided by the RTT, e.g. if the window size is 64K and the RTT is 0.5sec, the throughput is 128K/s.

So a neat way to visually determine if the receive window of clients may be too small should be to compare the distribution of unacknowledged byte values for the server versus the client's advertised receive window. If the unacked distribution overlaps the send window distribution such that it is to the right (or lower down in DTrace since quantizations are displayed vertically), it indicates that the amount of unacknowledged data regularly exceeds the client's receive window, so that it is possible that the sender may have more data to send but is blocked by a zero-window on the client side.

In the following example, we compare the distribution of unacked values to the receive window advertised by the receiver (10.175.96.92) for a large file download via http.

# dtrace -s tcp_tput.d
^C

  unacked(bytes)                                    10.175.96.92                                          80
           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |                                         6        
               1 |                                         0        
               2 |                                         0        
               4 |                                         0        
               8 |                                         0        
              16 |                                         0        
              32 |                                         0        
              64 |                                         0        
             128 |                                         0        
             256 |                                         3        
             512 |                                         0        
            1024 |                                         0        
            2048 |                                         9        
            4096 |                                         14       
            8192 |                                         27       
           16384 |                                         67       
           32768 |@@                                       1464     
           65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   32396    
          131072 |                                         0        

  SWND(bytes)                                         10.175.96.92                                          80
           value  ------------- Distribution ------------- count    
           16384 |                                         0        
           32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 17067    
           65536 |                                         0        

Here we have a puzzle. We can see that the receiver's advertised window is in the 32768-65535 range, while the amount of unacknowledged data in the pipe is largely in the 65536-131071 range. What's going on here? Surely in a case like this we should see zero-window events, since the amount of data in the pipe regularly exceeds the window size of the receiver. We can see that we don't see any zero-window events since the SWND distribution displays no 0 values - it stays within the 32768-65535 range.

The explanation is straightforward enough. TCP Window scaling is in operation for this connection - the Window Scale TCP option is used on connection setup to allow a connection to advertise (and have advertised to it) a window greater than 65536 bytes. In this case the scaling shift is 1, so this explains why the SWND values are clustered in the 32768-65535 range rather than the 65536-131071 range - the SWND value needs to be multiplied by two since the reciever is also scaling its window by a shift factor of 1.

Here's the simple script that compares unacked and SWND distributions, fixed to take account of window scaling.

#!/usr/sbin/dtrace -s

#pragma D option quiet

tcp:::send
/ (args[4]->tcp_flags & (TH_SYN|TH_RST|TH_FIN)) == 0 /
{
        @unacked["unacked(bytes)", args[2]->ip_daddr, args[4]->tcp_sport] =
            quantize(args[3]->tcps_snxt - args[3]->tcps_suna);
}

tcp:::receive
/ (args[4]->tcp_flags & (TH_SYN|TH_RST|TH_FIN)) == 0 /
{
        @swnd["SWND(bytes)", args[2]->ip_saddr, args[4]->tcp_dport] =
            quantize((args[4]->tcp_window)*(1 << args[3]->tcps_snd_ws));

}

And here's the fixed output.

# dtrace -s tcp_tput_scaled.d
^C
  unacked(bytes)                                     10.175.96.92                                          80
           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |                                         39       
               1 |                                         0        
               2 |                                         0        
               4 |                                         0        
               8 |                                         0        
              16 |                                         0        
              32 |                                         0        
              64 |                                         0        
             128 |                                         0        
             256 |                                         3        
             512 |                                         0        
            1024 |                                         0        
            2048 |                                         4        
            4096 |                                         9        
            8192 |                                         22       
           16384 |                                         37       
           32768 |@                                        99       
           65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   3858     
          131072 |                                         0        

  SWND(bytes)                                         10.175.96.92                                          80
           value  ------------- Distribution ------------- count    
             512 |                                         0        
            1024 |                                         1        
            2048 |                                         0        
            4096 |                                         2        
            8192 |                                         4        
           16384 |                                         7        
           32768 |                                         14       
           65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  1956     
          131072 |                                         0        
About

user12820842

Search

Archives
« November 2011
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
11
12
13
15
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today