Friday Jun 25, 2010

Measuring RTT using DTrace while dealing realistically with TCP's acknowledgement model

In a previous entry, I mentioned that the TCP round-trip time latency script I demonstrated there had some shortcomings. It's time to be a bit more specific as we dig a bit deeper into TCP.

The most glaring problem is the over-simplistic assumption it makes about TCP's acknowledgement model. Recall we traced tcp:::send events, noting what the next sequence number after that send event would be. For transmissions (as opposed to retransmissions), this corresponds the value stored in tcps_snxt in the tcpsinfo_t. We assumed that we would always get an acknowledgement specifying that sequence number and that there is a 1:1 relationship between segments and ACKs with each ACK acknowledging a corresponding segment. We used that assumption to match up ACKs to the original send times, and thus determined round-trip time. This assumption about TCP behaviour is not valid, but it works enough of the time for us to be able to get some RTT measurements. However, it is clearly both unreliable and wasteful to fill aggregations with timestamps for each segment and only use a subset - if any - for RTT measurements.

In reality, when TCP acknowledges the sequence number N, it is also acknowledging all data up to sequence number N. One example of where this is used is delayed acknowledgement. On receiving a segment, the receiving TCP does not always immediately acknowledge it - often delaying for a period on the grounds there may be data to send in response, rather then just sending an ACK segment. As long as no other data is received, TCP can delay before sending the ACK, but if more data is then received, an ACK covering multiple segments may be sent. We need to fix our RTT script to deal with this fact. The best way to do this is to store the first sequence number we encounter on a per-connection-id basis, along with the associated send timestamp. Then when we receive a segment, we check if it's ACK is greater than the stored send sequence number for that connection, we know it is ACKing at least that segment. This represents one RTT sample, and once this sample has been collected we are free to store another send sequence number for a subsequent send event. This deals with both the simplistic 1:1 segment:ack assumption of the previous script, and the wasteful use of aggregation space. Using this approach, only 1 aggregation value is used per connection at any one time instead of using one aggregation value per sent segment. The old approach had the potential to fill up aggregations with data as TCP sent segment after segment before receiving an ACK - the sliding window behaviour we discussed earlier.

So let's see the script:

#!/usr/sbin/dtrace -s

#pragma D option quiet


tcp:::send
/ (args[4]->tcp_flags & (TH_SYN|TH_RST|TH_FIN)) == 0 && unacked[args[1]->cs_cid] == 0 /
{
	seq[args[1]->cs_cid] = args[4]->tcp_seq;
	start[args[1]->cs_cid] = timestamp;
	unacked[args[1]->cs_cid] = 1;
}


tcp:::receive
/ unacked[args[1]->cs_cid] == 1 && args[4]->tcp_ack > seq[args[1]->cs_cid] /
{
	@meanrtt[args[2]->ip_saddr, args[4]->tcp_sport, args[1]->cs_cid] = 
	     avg(timestamp - start[args[1]->cs_cid]);
	@stddevrtt[args[2]->ip_saddr, args[4]->tcp_sport, args[1]->cs_cid] = 
	     stddev(timestamp - start[args[1]->cs_cid]);
	@minrtt[args[2]->ip_saddr, args[4]->tcp_sport, args[1]->cs_cid] = 
	     min(timestamp - start[args[1]->cs_cid]);
	@maxrtt[args[2]->ip_saddr, args[4]->tcp_sport, args[1]->cs_cid] = 
	     max(timestamp - start[args[1]->cs_cid]);
	@countrtt[args[2]->ip_saddr, args[4]->tcp_sport, args[1]->cs_cid] = 
	     count(); 
	unacked[args[1]->cs_cid] = 0;
}

END
{
	printf("%-20s %-6s %-10s %-10s %-10s %-10s %-8s\\n",
            "Remote host", "Port", "AvgRTT(ns)", "StdDev", "Min", "Max", "#Samples");
	printa("%-20s %-6d %@-10d %@-10d %@-10d %@-10d %@-8d\\n", @meanrtt, @stddevrtt, 
	    @minrtt, @maxrtt, @countrtt);
}

We record average, standard deviation, min and max RTT for non-control segments on a per-connection basis. At any one time, a connection can have at most one sequence number stored in the unacked[] associative array, and when this is acked, we record the RTT and reset unacked[connection-id] to 0 to allow for another sample. The upshot is we can only record one RTT at a time. Here's some output showing RTTs for a HTTP load:

# dtrace -s tcp_rtt4.d
\^C
Remote host          Port   AvgRTT(ns) StdDev     Min        Max        #Samples
82.195.132.153       80     86295028   0          86295028   86295028   1       
64.236.124.228       80     234437989  178807729  57704050   500949801  4       

We can see that there is an almost 10x difference between minimum and maximum RTT measurements for 64.236.124.228, ranging from 57.7msec to 500.9msec.

We've seen how long it takes from TCP segment transmission to receipt of acknowledgement. Now let's probe how long TCP takes from receiving a segment to sending an acknowledgement - the acknowledgement latency. As noted above, often TCP delays acknowledgement, but the delayed acknowledgement scheme can be switched off using the TCP_NODELAY socket option. By running apache on a local system and measuring time between segment receive and acknowledgement, and comparing this to the same measurement for the requesting TCP connection (from the wget process), we can see that apache does not appear to use delayed acknowledgement, so probably specifies the TCP_NODELAY socket option (subsequent checking indeed confirms this). By contrast the long delays associated with the localhost request (from a wget(1) command) suggest that wget does not use TCP_NODELAY. The following output shows results of "wget http://127.0.0.1", and we can see that the ACK times for the port 80 side of the connection is significantly faster than for the anon wget-requesting port (61670):

# dtrace -s tcp_ack.d           
\^C
Remote host          Port   AvgACK(ns) StdDev     Min        Max        #Samples
127.0.0.1            80     16352199   21798743   230152     47169266   3       
127.0.0.1            61670  1640239644 0          1640239644 1640239644 1       

Here's the script that measures ACK latency:

#!/usr/sbin/dtrace -s

#pragma D option quiet


tcp:::receive
/ (args[4]->tcp_flags & (TH_SYN|TH_RST|TH_FIN)) == 0 &&
  unacked[args[1]->cs_cid] == 0 /
{
 	seq[args[1]->cs_cid] = args[4]->tcp_seq;
	start[args[1]->cs_cid] = timestamp;
	unacked[args[1]->cs_cid] = 1;
}


tcp:::send
/ unacked[args[1]->cs_cid] == 1 && args[4]->tcp_ack > seq[args[1]->cs_cid] /
{
 	@meanack[args[2]->ip_daddr, args[4]->tcp_dport, args[1]->cs_cid] =
             avg(timestamp - start[args[1]->cs_cid]);
	@stddevack[args[2]->ip_daddr, args[4]->tcp_dport, args[1]->cs_cid] =
             stddev(timestamp - start[args[1]->cs_cid]);
	@minack[args[2]->ip_daddr, args[4]->tcp_dport, args[1]->cs_cid] =
             min(timestamp - start[args[1]->cs_cid]);
	@maxack[args[2]->ip_daddr, args[4]->tcp_dport, args[1]->cs_cid] =
             max(timestamp - start[args[1]->cs_cid]);
	@countack[args[2]->ip_daddr, args[4]->tcp_dport, args[1]->cs_cid] =
             count();
	unacked[args[1]->cs_cid] = 0;
}

END
{
 	printf("%-20s %-6s %-10s %-10s %-10s %-10s %-8s\\n",
            "Remote host", "Port", "AvgACK(ns)", "StdDev", "Min", "Max",
            "#Samples");
	printa("%-20s %-6d %@-10d %@-10d %@-10d %@-10d %@-8d\\n", @meanack,
            @stddevack, @minack, @maxack, @countack);
}
About

user12820842

Search

Archives
« June 2010 »
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
21
24
26
27
28
29
30
   
       
Today