Friday Jun 25, 2010

Measuring RTT using DTrace while dealing realistically with TCP's acknowledgement model

In a previous entry, I mentioned that the TCP round-trip time latency script I demonstrated there had some shortcomings. It's time to be a bit more specific as we dig a bit deeper into TCP.

The most glaring problem is the over-simplistic assumption it makes about TCP's acknowledgement model. Recall we traced tcp:::send events, noting what the next sequence number after that send event would be. For transmissions (as opposed to retransmissions), this corresponds the value stored in tcps_snxt in the tcpsinfo_t. We assumed that we would always get an acknowledgement specifying that sequence number and that there is a 1:1 relationship between segments and ACKs with each ACK acknowledging a corresponding segment. We used that assumption to match up ACKs to the original send times, and thus determined round-trip time. This assumption about TCP behaviour is not valid, but it works enough of the time for us to be able to get some RTT measurements. However, it is clearly both unreliable and wasteful to fill aggregations with timestamps for each segment and only use a subset - if any - for RTT measurements.

In reality, when TCP acknowledges the sequence number N, it is also acknowledging all data up to sequence number N. One example of where this is used is delayed acknowledgement. On receiving a segment, the receiving TCP does not always immediately acknowledge it - often delaying for a period on the grounds there may be data to send in response, rather then just sending an ACK segment. As long as no other data is received, TCP can delay before sending the ACK, but if more data is then received, an ACK covering multiple segments may be sent. We need to fix our RTT script to deal with this fact. The best way to do this is to store the first sequence number we encounter on a per-connection-id basis, along with the associated send timestamp. Then when we receive a segment, we check if it's ACK is greater than the stored send sequence number for that connection, we know it is ACKing at least that segment. This represents one RTT sample, and once this sample has been collected we are free to store another send sequence number for a subsequent send event. This deals with both the simplistic 1:1 segment:ack assumption of the previous script, and the wasteful use of aggregation space. Using this approach, only 1 aggregation value is used per connection at any one time instead of using one aggregation value per sent segment. The old approach had the potential to fill up aggregations with data as TCP sent segment after segment before receiving an ACK - the sliding window behaviour we discussed earlier.

So let's see the script:

#!/usr/sbin/dtrace -s

#pragma D option quiet


tcp:::send
/ (args[4]->tcp_flags & (TH_SYN|TH_RST|TH_FIN)) == 0 && unacked[args[1]->cs_cid] == 0 /
{
	seq[args[1]->cs_cid] = args[4]->tcp_seq;
	start[args[1]->cs_cid] = timestamp;
	unacked[args[1]->cs_cid] = 1;
}


tcp:::receive
/ unacked[args[1]->cs_cid] == 1 && args[4]->tcp_ack > seq[args[1]->cs_cid] /
{
	@meanrtt[args[2]->ip_saddr, args[4]->tcp_sport, args[1]->cs_cid] = 
	     avg(timestamp - start[args[1]->cs_cid]);
	@stddevrtt[args[2]->ip_saddr, args[4]->tcp_sport, args[1]->cs_cid] = 
	     stddev(timestamp - start[args[1]->cs_cid]);
	@minrtt[args[2]->ip_saddr, args[4]->tcp_sport, args[1]->cs_cid] = 
	     min(timestamp - start[args[1]->cs_cid]);
	@maxrtt[args[2]->ip_saddr, args[4]->tcp_sport, args[1]->cs_cid] = 
	     max(timestamp - start[args[1]->cs_cid]);
	@countrtt[args[2]->ip_saddr, args[4]->tcp_sport, args[1]->cs_cid] = 
	     count(); 
	unacked[args[1]->cs_cid] = 0;
}

END
{
	printf("%-20s %-6s %-10s %-10s %-10s %-10s %-8s\\n",
            "Remote host", "Port", "AvgRTT(ns)", "StdDev", "Min", "Max", "#Samples");
	printa("%-20s %-6d %@-10d %@-10d %@-10d %@-10d %@-8d\\n", @meanrtt, @stddevrtt, 
	    @minrtt, @maxrtt, @countrtt);
}

We record average, standard deviation, min and max RTT for non-control segments on a per-connection basis. At any one time, a connection can have at most one sequence number stored in the unacked[] associative array, and when this is acked, we record the RTT and reset unacked[connection-id] to 0 to allow for another sample. The upshot is we can only record one RTT at a time. Here's some output showing RTTs for a HTTP load:

# dtrace -s tcp_rtt4.d
\^C
Remote host          Port   AvgRTT(ns) StdDev     Min        Max        #Samples
82.195.132.153       80     86295028   0          86295028   86295028   1       
64.236.124.228       80     234437989  178807729  57704050   500949801  4       

We can see that there is an almost 10x difference between minimum and maximum RTT measurements for 64.236.124.228, ranging from 57.7msec to 500.9msec.

We've seen how long it takes from TCP segment transmission to receipt of acknowledgement. Now let's probe how long TCP takes from receiving a segment to sending an acknowledgement - the acknowledgement latency. As noted above, often TCP delays acknowledgement, but the delayed acknowledgement scheme can be switched off using the TCP_NODELAY socket option. By running apache on a local system and measuring time between segment receive and acknowledgement, and comparing this to the same measurement for the requesting TCP connection (from the wget process), we can see that apache does not appear to use delayed acknowledgement, so probably specifies the TCP_NODELAY socket option (subsequent checking indeed confirms this). By contrast the long delays associated with the localhost request (from a wget(1) command) suggest that wget does not use TCP_NODELAY. The following output shows results of "wget http://127.0.0.1", and we can see that the ACK times for the port 80 side of the connection is significantly faster than for the anon wget-requesting port (61670):

# dtrace -s tcp_ack.d           
\^C
Remote host          Port   AvgACK(ns) StdDev     Min        Max        #Samples
127.0.0.1            80     16352199   21798743   230152     47169266   3       
127.0.0.1            61670  1640239644 0          1640239644 1640239644 1       

Here's the script that measures ACK latency:

#!/usr/sbin/dtrace -s

#pragma D option quiet


tcp:::receive
/ (args[4]->tcp_flags & (TH_SYN|TH_RST|TH_FIN)) == 0 &&
  unacked[args[1]->cs_cid] == 0 /
{
 	seq[args[1]->cs_cid] = args[4]->tcp_seq;
	start[args[1]->cs_cid] = timestamp;
	unacked[args[1]->cs_cid] = 1;
}


tcp:::send
/ unacked[args[1]->cs_cid] == 1 && args[4]->tcp_ack > seq[args[1]->cs_cid] /
{
 	@meanack[args[2]->ip_daddr, args[4]->tcp_dport, args[1]->cs_cid] =
             avg(timestamp - start[args[1]->cs_cid]);
	@stddevack[args[2]->ip_daddr, args[4]->tcp_dport, args[1]->cs_cid] =
             stddev(timestamp - start[args[1]->cs_cid]);
	@minack[args[2]->ip_daddr, args[4]->tcp_dport, args[1]->cs_cid] =
             min(timestamp - start[args[1]->cs_cid]);
	@maxack[args[2]->ip_daddr, args[4]->tcp_dport, args[1]->cs_cid] =
             max(timestamp - start[args[1]->cs_cid]);
	@countack[args[2]->ip_daddr, args[4]->tcp_dport, args[1]->cs_cid] =
             count();
	unacked[args[1]->cs_cid] = 0;
}

END
{
 	printf("%-20s %-6s %-10s %-10s %-10s %-10s %-8s\\n",
            "Remote host", "Port", "AvgACK(ns)", "StdDev", "Min", "Max",
            "#Samples");
	printa("%-20s %-6d %@-10d %@-10d %@-10d %@-10d %@-8d\\n", @meanack,
            @stddevack, @minack, @maxack, @countack);
}

Tuesday Jun 22, 2010

Improved DTrace tcp acknowledgement RTT latency script

In the previous post, I concluded with a rough script that measures mean acknowledgement latency averaged across remote host. Acknowledgement latency or round-trip time (RTT) is the time spent waiting between when a TCP segment is sent and an acknowledgement for that data is received. Acknowledgment round-trip time is an important measure as it relates how much data a TCP connection can send - if data is acknowledged quickly, TCP's sliding window can progress. As noted here, for the highest possible throughput, it is important that the transmitter is not forced to stop sending by the sliding window protocol earlier than one round-trip delay time (RTT) - in other words, the window should not fill up before the earliest data is acknowledged, otherwise the sender cannot send any more data until that acknowledgement is received. A number of factors may contribute to these kinds of delays - small advertised windows, slow protocol processing etc - but the important point is we can now use DTrace to measure both RTT and examine how the sliding window evolves over time. Here's a better version of the previous script, breaking down RTT by host and port.

#!/usr/sbin/dtrace -s

#pragma D option quiet

tcp:::send
{
	snxt[args[1]->cs_cid, args[3]->tcps_snxt] = timestamp;
	bytesrttttmp[args[1]->cs_cid, args[3]->tcps_snxt] =
	    args[2]->ip_plength - args[4]->tcp_offset;
}

tcp:::receive
/ snxt[args[1]->cs_cid, args[4]->tcp_ack] /
{
	@bytesrtt[args[2]->ip_saddr, args[4]->tcp_sport] =
	    sum(bytesrttttmp[args[1]->cs_cid, args[4]->tcp_ack]);
	@meanrtt[args[2]->ip_saddr, args[4]->tcp_sport] = 
	    avg(timestamp - snxt[args[1]->cs_cid, args[4]->tcp_ack]);
	@stddevrtt[args[2]->ip_saddr, args[4]->tcp_sport] =
            stddev(timestamp - snxt[args[1]->cs_cid, args[4]->tcp_ack]);
	@countrtt[args[2]->ip_saddr, args[4]->tcp_sport] = count();
	snxt[args[1]->cs_cid, args[4]->tcp_ack] = 0;
	bytesrttttmp[args[1]->cs_cid, args[4]->tcp_ack] = 0;
}

END
{
	printf("%-20s %-8s %-15s %-15s %-10s %-7s\\n",
	    "Remote host", "Port", "TCP Avg RTT(ns)", "StdDev", "NumBytes",
	    "NumSegs");
	printa("%-20s %-8d %@-15d %@-15d %@-10d %@-7d\\n", @meanrtt,
	    @stddevrtt, @bytesrtt, @countrtt);
}

And here's some sample output:

# dtrace -s tcp_rtt2.d
\^C
Remote host          Port     TCP Avg RTT(ns) StdDev          NumBytes   NumSegs
127.0.0.1            9010     109074          79551           0          15     
127.0.0.1            59571    204945          0               0          1      
127.0.0.1            35748    205471          0               0          1      
127.0.0.1            39697    206126          0               0          1      
127.0.0.1            58889    207170          0               0          1      
127.0.0.1            57647    208967          0               0          1      
127.0.0.1            53502    215569          0               0          1      
127.0.0.1            41038    221163          0               0          1      
127.0.0.1            36777    226006          0               0          1      
::1                  9010     233492          186640          0          15     
127.0.0.1            41241    256225          0               0          1      
127.0.0.1            57388    297646          0               0          1      
127.0.0.1            55260    424590          0               0          1      
127.0.0.1            60660    445139          0               0          1      
127.0.0.1            56460    620226          0               0          1      
127.0.0.1            55449    670529          0               0          1      
127.0.0.1            56877    772734          0               0          1      
::1                  62343    1532445         0               0          1      
::1                  51756    1538747         0               0          1      
::1                  41825    1627673         0               0          1      
::1                  56580    1965449         0               0          1      
::1                  65061    2412497         0               0          1      
::1                  50031    2726626         0               0          1      
159.134.196.178      80       2788676         1773030         0          6      
::1                  43043    3465254         0               0          1      
::1                  57685    4454573         0               0          1      
::1                  55854    4490483         0               0          1      
92.123.196.20        80       19855492        0               0          1      
::1                  50972    20481704        0               0          1      
::1                  48172    53565332        0               0          1      
66.235.139.54        80       81219406        80478253        0          2      
::1                  37341    95086544        0               0          1      
::1                  32770    1512858144      0               0          1      
::1                  49710    4197402456      0               0          1      

Next time, we'll look at TCP's sliding window behaviour.

Measuring latencies with the DTrace TCP provider

As I mentioned in the previous entry, the DTrace tcp provider can be used to carry out a number of useful latency measurements. We have included some scripts that help with these measurements, and I'll try and explain them here. Another useful reference is Bryan and Brendan's CEC2006 presentation, though some minor details have changed in the TCP provider arguments since then - consult the updated docs when in doubt.

In the last entry, we used the tcp:::send and tcp:::receive probe points exclusively. There are also a number of probe points relating to TCP connection establishment:

tcp:::connect-request - initial SYN sent as part of three-way handshake to establish active connection
tcp:::connect-refused - intial SYN was refused and we receive a RST|ACK "reset" segment
tcp:::connect-established - we send a final ACK, completing establishment of the active connection
tcp:::accept-refused - initial SYN has been received, but nothing is listening. Send RST|ACK "reset" segment
tcp:::accept-established - final ACK has been received and the connection is established passively for the listener
The connect-\* probes relate to the initiating or active side of the TCP connection, while the accept-\* probes are for the listening or passive side of the connection.

Connection latency

The tcpconnlat.d scripts measures the latency incurred for the TCP 3-way handshake to complete for outbound connections. The tcpconnlat.d script measures this using a quantized aggregation - in other words connection latencies are grouped into buckets up to latency n to 2n, 2n to 4n etc. Below are the results of running the script on one window while running a number of wgets to various websites:

# dtrace -s /usr/demo/dtrace/tcpconnlat.d
dtrace: script '/usr/demo/dtrace/tcpconnlat.d' matched 2 probes
\^C

  Connect Latency (ns)                                64.210.72.58                                      
           value  ------------- Distribution ------------- count    
        16777216 |                                         0        
        33554432 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1        
        67108864 |                                         0        

  Connect Latency (ns)                                74.125.45.103                                     
           value  ------------- Distribution ------------- count    
        16777216 |                                         0        
        33554432 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1        
        67108864 |                                         0        

  Connect Latency (ns)                                192.9.164.55                                      
           value  ------------- Distribution ------------- count    
        33554432 |                                         0        
        67108864 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1        
       134217728 |                                         0        

  Connect Latency (ns)                                192.9.164.72                                      
           value  ------------- Distribution ------------- count    
        33554432 |                                         0        
        67108864 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1        
       134217728 |                                         0        

The script works by measuring time between the firing of the tcp:::connect-request probe (which fires when the initial SYN of the three-way handshake is sent) and the final ACK is sent (indicating the connection has been established). The script uses the unique-per-connection connection id (cs_cid) to record the starting timestamp for comparison.

tcp:::connect-request
{
        start[args[1]->cs_cid] = timestamp;
}

tcp:::connect-established
/start[args[1]->cs_cid] /
{
        @latency["Connect Latency (ns)", args[3]->tcps_raddr] =
            quantize(timestamp - start[args[1]->cs_cid]);
        start[args[1]->cs_cid] = 0;
}

First-byte latency

First byte latency is the time from connection establishment to when data is received. The script, tcp1stbyte.d, is geared towards the initiating connection (since it times from tcp:::connect-established to tcp:::receive firing), but could be rewritten to be more general by replacing "tcp:::connect-established {}" with "tcp:::state-change / args[3]->tcps_state == TCP_STATE_ESTABLISHED/ {}").

# dtrace -s /usr/demo/dtrace/tcp1stbyte.d
dtrace: script '/usr/demo/dtrace/tcp1stbyte.d' matched 16 probes
\^C

  1st Byte Latency (ns)                               89.207.56.140                                     
           value  ------------- Distribution ------------- count    
        16777216 |                                         0        
        33554432 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1        
        67108864 |                                         0        

  1st Byte Latency (ns)                               66.102.9.99                                       
           value  ------------- Distribution ------------- count    
        33554432 |                                         0        
        67108864 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1        
       134217728 |                                         0        

Again, the script is straightforward:

tcp:::connect-established
{
        start[args[1]->cs_cid] = timestamp;
}

tcp:::receive
/start[args[1]->cs_cid] && (args[2]->ip_plength - args[4]->tcp_offset) > 0/
{
        @latency["1st Byte Latency (ns)", args[2]->ip_saddr] =
            quantize(timestamp - start[args[1]->cs_cid]);
        start[args[1]->cs_cid] = 0;
}

Round-trip latency

Here's a script not included in the demo scripts, and even though there are edge cases it probably won't cover, it's a useful demonstration of some additional features of the TCP provider. It measures average round-trip time from sending to acknowledgement of TCP segments on a per-host basis. Here's the output:

# dtrace -s tcp_rtt.d
\^C
Remote host               TCP Avg RTT(ns) StdDev          NumBytes   NumSegs   
127.0.0.1                 255635          162214          0          4         
::1                       2718049         3127457         0          4         
66.102.9.99               25552916        27326796        230        12        
89.207.56.140             28347521        57559039        110        28        
74.125.105.87             30258841        0               0          1         

We see localhost RTTs are smallest for IPv4, then IPv6, then remote hosts. It is interesting that IPv6 localhost RTT is so much greater. The script itself is a bit more complex:

#!/usr/sbin/dtrace -s

#pragma D option quiet

tcp:::send
{
	snxt[args[1]->cs_cid, args[3]->tcps_snxt] = timestamp;
	@bytesrtt[args[2]->ip_daddr] =
	    sum(args[2]->ip_plength - args[4]->tcp_offset);
	@countrtt[args[2]->ip_daddr] = count();
}

tcp:::receive
/ snxt[args[1]->cs_cid, args[4]->tcp_ack] /
{
	@meanrtt[args[2]->ip_saddr] = 
	    avg(timestamp - snxt[args[1]->cs_cid, args[4]->tcp_ack]);
	@stddevrtt[args[2]->ip_saddr] =
	    stddev(timestamp - snxt[args[1]->cs_cid, args[4]->tcp_ack]);
	snxt[args[1]->cs_cid, args[4]->tcp_ack] = 0;
}

END
{
	printf("%-25s %-15s %-15s %-10s %-10s\\n",
	    "Remote host", "TCP Avg RTT(ns)", "StdDev", "NumBytes", "NumSegs");
	printa("%-25s %@-15d %@-15d %@-10d %@-10d\\n",
	    @meanrtt, @stddevrtt, @bytesrtt, @countrtt);
}

We use an associative array snxt, which stores timestamps associated with a particular connection and the next segment sequence number that will be sent. The tcpsinfo value tcps_snxt represents the next sequence number that will be used for sending, so at a tcp:::send probe point, it represents the ACK sequence number we would expect to receive for that segment. Recall, when we acknowledge a TCP segment, we acknowledge it with the next sequence number we expect to receive, so when this segment is acknowledged, we will most probably receive a segment with that sequence number specified as the acknowledgement number. So when we receive a segment for that connection _and_ having that sequence number as acknowledgement, we can use the original send timestamp to determine round-trip acknowledgement time. This is then averaged across all connections for the remote host to get a mean acknowledgement RTT. Note that this scheme will break down for selective acknowledgement (among other cases) but it is a neat demonstration of the kind of measurement that is possible, and represents an easy way to find slow peer systems. Suspiciously high round trip times would merit further investigation, especially in cases where machines are on the same network segments and thus should have broadly similar latencies. It should be noted that lumping all the ports together (as this script does) leads to high standard deviations for average round-trip times, so a drilldown by port would likely be merited.

About

user12820842

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today