Wednesday Jun 23, 2010

DTrace tcp provider and TCP segment sizes

At CEC2006, Brendan and Bryan showed how we can use a neat feature of DTrace in conjunction with the DTrace tcp provider - we can represent byte sizes sent and received using a quantized aggregation, which visually represents the number of values that fall within ranges n to 2n, 2n to 4n, etc. This is particularly useful for TCP, since for peak throughput we would expect a bimodal distribution with one spike at 0 (for control segments with no payload) and another in the interval containing the maximum segment size (MSS) for the TCP connection. Viewing send and receive quantized aggregations (aggregated by remote host IP address) in conjunction with the mean MSS for connections to the remote host, we do indeed see this pattern on the receive side for 199.222.69.17 in the case below, but obviously the results will vary depending on the TCP connection load characteristics. In this case, the load consisted of a HTTP request/response, the former of which will often fall below the MSS in size (we see 1 TCP send in the 512-1024 range), while the receive segments are nearly all (32) in the MSS range (1024-2047). For localhost (127.0.0.1, :::1), we see a number of control segments of 0 size only.


# dtrace -s tcp_payloadsize.d
\^C
MSS (bytes)          199.222.69.17            1372                
MSS (bytes)          127.0.0.1                4116                
MSS (bytes)          ::1                      4126                

  Sent segment size (bytes)                           127.0.0.1                                         
           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 18       
               1 |                                         0        

  Sent segment size (bytes)                           ::1                                               
           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 18       
               1 |                                         0        

  Sent segment size (bytes)                           199.222.69.17                                     
           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   21       
               1 |                                         0        
               2 |                                         0        
               4 |                                         0        
               8 |                                         0        
              16 |                                         0        
              32 |                                         0        
              64 |                                         0        
             128 |                                         0        
             256 |                                         0        
             512 |@@                                       1        
            1024 |                                         0        

  Received segment size (bytes)                       127.0.0.1                                         
           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 18       
               1 |                                         0        

  Received segment size (bytes)                       ::1                                               
           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 18       
               1 |                                         0        

  Received segment size (bytes)                       199.222.69.17                                     
           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |@@                                       2        
               1 |                                         0        
               2 |                                         0        
               4 |                                         0        
               8 |                                         0        
              16 |                                         0        
              32 |                                         0        
              64 |                                         0        
             128 |                                         0        
             256 |@                                        1        
             512 |                                         0        
            1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    32       
            2048 |                                         0        

Here is the script that gathered these results. The MSS is averaged as it can change over time - more on that later.

#!/usr/sbin/dtrace -s

#pragma D option quiet

tcp:::send
{
 	@transmit["Sent segment size (bytes)", args[2]->ip_daddr] =
            quantize(args[2]->ip_plength - args[4]->tcp_offset);
	@mss["MSS (bytes)", args[2]->ip_daddr] =
            avg(args[3]->tcps_mss);
}

tcp:::receive
{
 	@receive["Received segment size (bytes)", args[2]->ip_saddr] =
            quantize(args[2]->ip_plength - args[4]->tcp_offset);
	@mss["MSS (bytes)", args[2]->ip_saddr] =
            avg(args[3]->tcps_mss);
}

END
{
        printa("%-20s %-24s %@-20d\\n", @mss);
}

Monitoring TCP retransmission using the DTrace tcp provider

Thus far, I've tried to cover how we probe some of the basic features of TCP using the DTrace TCP provider, touching on connection establishment (measuring connection and first-byte latency), TCP's acknowledgement scheme (measuring acknowledgment latency) and TCP's sliding window model (measuring zero window idle time). Before moving on to more advanced topics, I'd like to look at another basic aspect of TCP's acknowledgement-driven reliability scheme - retransmission.

When TCP sends a segment, a retransmission timer is started, and if no acknowledgement is received for that segment after the retransmission timer has expired, it is assumed lost and will be retransmitted. Such retransmission events are easy to pick out using the DTrace provider - the tcps_retransmit field of args[3] (the tcpsinfo_t representing TCP state associated with the connection) will be 1 for tcp:::send events corresponding to retransmission. The following script monitors how many payload (i.e. non-header) bytes are transmitted and retransmitted for each remote host/port:

#!/usr/sbin/dtrace -s

#pragma D option quiet

tcp:::send
/ (args[2]->ip_plength - args[4]->tcp_offset) > 0 / 
{
	@transmit[args[2]->ip_daddr, args[4]->tcp_dport] =
	    sum(args[2]->ip_plength - args[4]->tcp_offset);
}

tcp:::send
/ (args[2]->ip_plength - args[4]->tcp_offset) > 0 &&
  args[3]->tcps_retransmit == 1/
{
	@retransmit[args[2]->ip_daddr, args[4]->tcp_dport] =
	    sum(args[2]->ip_plength - args[4]->tcp_offset);
}

END
{
	printf("%-25s %-15s %-15s %-15s\\n",
	    "Remote host", "Port", "BytesSent", "BytesResent");
	printa("%-25s %-15d %@-15d %@-15d\\n",
	    @transmit, @retransmit);
}

And here's some sample output showing retransmission events for 212.147.135.190/80 and 87.248.210.254/80 - 1042 bytes out of 4189 and 579 bytes out of 5811 were retransmitted respectively. When the ratio of retrasmitted to transmitted bytes is high, further investigation is merited.

#  dtrace -s tcp_retransmit.d
\^C
Remote host               Port            BytesSent       BytesResent    
66.102.9.149              80              451             0              
65.242.27.32              80              837             0              
207.123.61.126            80              2558            0              
159.134.196.136           80              2573            0              
216.34.207.180            80              2852            0              
212.147.135.190           80              4189            1042           
87.248.210.254            80              5811            579            
89.207.56.140             80              5922            0              
66.102.9.144              80              9174            0              
82.199.80.141             80              14626           0              

DTrace TCP provider and TCP's sliding window

In the last post, I mentioned I'd demonstrate how we can probe TCP sliding window behaviour. When examining performance problems relating to sliding window on a system, the key thing to determine is if we are encountering situations where either:

  • the send window of any TCP connections is zero. This indicates we can no longer send data since the peer connection is advertising a zero-sized window.
  • the receive window of any TCP connections is zero. This indicates we can no longer receive data since we are advertising a zero-sized window.

If either of these situations occurs regularly, we will then need to dig down and find out why the send or receive windows are filling up, but the first step is to determine if these events are occuring, and if so, how long are we stuck unable to send or receive new data? We can do this by timing the interval between zero window advertisement send and new data receive (for receive window) and receipt of a zero window advertisement and new data send (for send window). Here is a script which averages zero receive window idletime by remote address/port. This is the average time between when we advertise a zero window and when we resume receiving new data.

#!/usr/sbin/dtrace -s

#pragma D option quiet


tcp:::send
/ args[4]->tcp_window == 0 && (args[4]->tcp_flags & TH_RST) == 0 /
{
        rwndclosed[args[1]->cs_cid] = timestamp;
	rwndrnxt[args[1]->cs_cid] = args[3]->tcps_rnxt;
        @numrwndclosed[args[2]->ip_daddr, args[4]->tcp_dport] = count();
}

tcp:::receive
/ rwndclosed[args[1]->cs_cid] && args[4]->tcp_seq >= rwndrnxt[args[1]->cs_cid] /
{
        @meantimeclosed[args[2]->ip_saddr, args[4]->tcp_sport] =
            avg(timestamp - rwndclosed[args[1]->cs_cid]);
	@stddevtimeclosed[args[2]->ip_saddr, args[4]->tcp_sport] =
            stddev(timestamp - rwndclosed[args[1]->cs_cid]);
	rwndclosed[args[1]->cs_cid] = 0;
        rwndrnxt[args[1]->cs_cid] = 0;
}

END
{
        printf("%-20s %-8s %-25s %-8s %-8s\\n",
            "Remote host", "Port", "TCP Avg RwndClosed(ns)", "StdDev",
            "Num");
        printa("%-20s %-8d %@-25d %@-8d %@-8d\\n", @meantimeclosed,
            @stddevtimeclosed, @numrwndclosed);
}

Here's the output showing one receive window close event, triggered by opening a hi-res YouTube video in a browser.

# dtrace -s tcp_rwndclosed.d
\^C
Remote host          Port     TCP Avg RwndClosed(ns)    StdDev   Num     
92.122.127.159       80       26914620                  0        1       

We can see we were stuck unable to receive new data for 0.269 seconds since our receive window was 0. Here's the send window equivalent:

#!/usr/sbin/dtrace -s

#pragma D option quiet


tcp:::receive
/ args[4]->tcp_window == 0 && (args[4]->tcp_flags & TH_RST) == 0 /
{
        swndclosed[args[1]->cs_cid] = timestamp;
	swndsnxt[args[1]->cs_cid] = args[3]->tcps_snxt;
	@numswndclosed[args[2]->ip_saddr, args[4]->tcp_sport] = count();
}

tcp:::send
/ swndclosed[args[1]->cs_cid] && args[4]->tcp_seq >= swndsnxt[args[1]->cs_cid] /
{
        @meantimeclosed[args[2]->ip_daddr, args[4]->tcp_dport] =
            avg(timestamp - swndclosed[args[1]->cs_cid]);
	@stddevtimeclosed[args[2]->ip_daddr, args[4]->tcp_dport] =
            stddev(timestamp - swndclosed[args[1]->cs_cid]);
	swndclosed[args[1]->cs_cid] = 0;
	swndsnxt[args[1]->cs_cid] = 0;
}

END
{
        printf("%-20s %-8s %-25s %-8s %-8s\\n",
            "Remote host", "Port", "TCP Avg SwndClosed(ns)", "StdDev",
            "Num");
        printa("%-20s %-8d %@-25d %@-8d %@-8d\\n", @meantimeclosed,
            @stddevtimeclosed, @numswndclosed);
}

If a lot of zero-window events are occuring, a good next step would be to use the acknowledgement RTT latency measurement script in the previous entry - this will help determine if the ACK RTT is sufficiently high such that the send window fills up before any data is acknowledged.

About

user12820842

Search

Archives
« June 2010 »
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
21
24
26
27
28
29
30
   
       
Today