Wednesday Nov 16, 2011

Exploring TCP throughput with DTrace (2)

Last time, I described how we can use the overlap in distributions of unacknowledged byte counts and send window to determine whether the peer's receive window may be too small, limiting throughput. Let's combine that comparison with a comparison of congestion window and slow start threshold, all on a per-port/per-client basis. This will help us

  • Identify whether the congestion window or the receive window are limiting factors on throughput by comparing the distributions of congestion window and send window values to the distribution of outstanding (unacked) bytes. This will allow us to get a visual sense for how often we are thwarted in our attempts to fill the pipe due to congestion control versus the peer not being able to receive any more data.
  • Identify whether slow start or congestion avoidance predominate by comparing the overlap in the congestion window and slow start distributions. If the slow start threshold distribution overlaps with the congestion window, we know that we have switched between slow start and congestion avoidance, possibly multiple times.
  • Identify whether the peer's receive window is too small by comparing the distribution of outstanding unacked bytes with the send window distribution (i.e. the peer's receive window). I discussed this here.

# dtrace -s tcp_window.d
dtrace: script 'tcp_window.d' matched 10 probes
^C

  cwnd                                                  80  10.175.96.92                                      
           value  ------------- Distribution ------------- count    
            1024 |                                         0        
            2048 |                                         4        
            4096 |                                         6        
            8192 |                                         18       
           16384 |                                         36       
           32768 |@                                        79       
           65536 |@                                        155      
          131072 |@                                        199      
          262144 |@@@                                      400      
          524288 |@@@@@@                                   798      
         1048576 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@             3848     
         2097152 |                                         0        

  ssthresh                                              80  10.175.96.92                                      
           value  ------------- Distribution ------------- count    
       268435456 |                                         0        
       536870912 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 5543     
      1073741824 |                                         0        

  unacked                                               80  10.175.96.92                                      
           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |                                         1        
               1 |                                         0        
               2 |                                         0        
               4 |                                         0        
               8 |                                         0        
              16 |                                         0        
              32 |                                         0        
              64 |                                         0        
             128 |                                         0        
             256 |                                         3        
             512 |                                         0         
            1024 |                                         0        
            2048 |                                         4        
            4096 |                                         9        
            8192 |                                         21       
           16384 |                                         36       
           32768 |@                                        78       
           65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  5391     
          131072 |                                         0        

  swnd                                                  80  10.175.96.92                                      
           value  ------------- Distribution ------------- count    
           32768 |                                         0        
           65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 5543     
          131072 |                                         0        

Here we are observing a large file transfer via http on the webserver. Comparing these distributions, we can observe:

  • That slow start congestion control is in operation. The distribution of congestion window values lies below the range of slow start threshold values (which are in the 536870912+ range), so the connection is in slow start mode.
  • Both the unacked byte count and the send window values peak in the 65536-131071 range, but the send window value distribution is narrower. This tells us that the peer TCP's receive window is not closing.
  • The congestion window distribution peaks in the 1048576 - 2097152 range while the receive window distribution is confined to the 65536-131071 range. Since the cwnd distribution ranges as low as 2048-4095, we can see that for some of the time we have been observing the connection, congestion control has been a limiting factor on transfer, but for the majority of the time the receive window of the peer would more likely have been the limiting factor. However, we know the window has never closed as the distribution of swnd values stays within the 65536-131071 range.
So all in all we have a connection that has been mildly constrained by congestion control, but for the bulk of the time we have been observing it neither congestion or peer receive window have limited throughput. Here's the script:

#!/usr/sbin/dtrace -s


tcp:::send
/ (args[4]->tcp_flags & (TH_SYN|TH_RST|TH_FIN)) == 0 /
{
        @cwnd["cwnd", args[4]->tcp_sport, args[2]->ip_daddr] =
            quantize(args[3]->tcps_cwnd);
        @ssthresh["ssthresh", args[4]->tcp_sport, args[2]->ip_daddr] =
            quantize(args[3]->tcps_cwnd_ssthresh);
        @unacked["unacked", args[4]->tcp_sport, args[2]->ip_daddr] =
            quantize(args[3]->tcps_snxt - args[3]->tcps_suna);
        @swnd["swnd", args[4]->tcp_sport, args[2]->ip_daddr] =
            quantize((args[4]->tcp_window)*(1 << args[3]->tcps_snd_ws));
}

One surprise here is that slow start is still in operation - one would assume that for a large file transfer, acknowledgements would push the congestion window up past the slow start threshold over time. The slow start threshold is in fact still close to it's initial (very high) value, so that would suggest we have not experienced any congestion (the slow start threshold is adjusted when congestion occurs). Also, the above measurements were taken early in the connection lifetime, so the congestion window did not get a changes to get bumped up to the level of the slow start threshold.

A good strategy when examining these sorts of measurements for a given service (such as a webserver) would be start by examining the distributions above aggregated by port number only to get an overall feel for service performance, i.e. is congestion control or peer receive window size an issue, or are we unconstrained to fill the pipe? From there, the overlap of distributions will tell us whether to drill down into specific clients. For example if the send window distribution has multiple peaks, we may want to examine if particular clients show issues with their receive window.

Exploring TCP throughput with DTrace

One key measure to use when assessing TCP throughput is assessing the amount of unacknowledged data in the pipe. In DTrace terms, the amount of unacknowledged data in bytes for the connection is the different between the next sequence number to send and the lowest unacknoweldged sequence number (tcps_snxt - tcps_suna). According to the theory, when the number of unacknowledged bytes for the connection is less than the receive window of the peer, the path bandwidth is the limiting factor for throughput. In other words, if we can fill the pipe without the peer TCP complaining (by virtue of its window size reaching 0), we are purely bandwidth-limited. If the peer's receive window is too small however, the sending TCP has to wait for acknowledgements before it can send more data. In this case the round-trip time (RTT) limits throughput. In such cases the effective throughput limit is the window size divided by the RTT, e.g. if the window size is 64K and the RTT is 0.5sec, the throughput is 128K/s.

So a neat way to visually determine if the receive window of clients may be too small should be to compare the distribution of unacknowledged byte values for the server versus the client's advertised receive window. If the unacked distribution overlaps the send window distribution such that it is to the right (or lower down in DTrace since quantizations are displayed vertically), it indicates that the amount of unacknowledged data regularly exceeds the client's receive window, so that it is possible that the sender may have more data to send but is blocked by a zero-window on the client side.

In the following example, we compare the distribution of unacked values to the receive window advertised by the receiver (10.175.96.92) for a large file download via http.

# dtrace -s tcp_tput.d
^C

  unacked(bytes)                                    10.175.96.92                                          80
           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |                                         6        
               1 |                                         0        
               2 |                                         0        
               4 |                                         0        
               8 |                                         0        
              16 |                                         0        
              32 |                                         0        
              64 |                                         0        
             128 |                                         0        
             256 |                                         3        
             512 |                                         0        
            1024 |                                         0        
            2048 |                                         9        
            4096 |                                         14       
            8192 |                                         27       
           16384 |                                         67       
           32768 |@@                                       1464     
           65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   32396    
          131072 |                                         0        

  SWND(bytes)                                         10.175.96.92                                          80
           value  ------------- Distribution ------------- count    
           16384 |                                         0        
           32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 17067    
           65536 |                                         0        

Here we have a puzzle. We can see that the receiver's advertised window is in the 32768-65535 range, while the amount of unacknowledged data in the pipe is largely in the 65536-131071 range. What's going on here? Surely in a case like this we should see zero-window events, since the amount of data in the pipe regularly exceeds the window size of the receiver. We can see that we don't see any zero-window events since the SWND distribution displays no 0 values - it stays within the 32768-65535 range.

The explanation is straightforward enough. TCP Window scaling is in operation for this connection - the Window Scale TCP option is used on connection setup to allow a connection to advertise (and have advertised to it) a window greater than 65536 bytes. In this case the scaling shift is 1, so this explains why the SWND values are clustered in the 32768-65535 range rather than the 65536-131071 range - the SWND value needs to be multiplied by two since the reciever is also scaling its window by a shift factor of 1.

Here's the simple script that compares unacked and SWND distributions, fixed to take account of window scaling.

#!/usr/sbin/dtrace -s

#pragma D option quiet

tcp:::send
/ (args[4]->tcp_flags & (TH_SYN|TH_RST|TH_FIN)) == 0 /
{
        @unacked["unacked(bytes)", args[2]->ip_daddr, args[4]->tcp_sport] =
            quantize(args[3]->tcps_snxt - args[3]->tcps_suna);
}

tcp:::receive
/ (args[4]->tcp_flags & (TH_SYN|TH_RST|TH_FIN)) == 0 /
{
        @swnd["SWND(bytes)", args[2]->ip_saddr, args[4]->tcp_dport] =
            quantize((args[4]->tcp_window)*(1 << args[3]->tcps_snd_ws));

}

And here's the fixed output.

# dtrace -s tcp_tput_scaled.d
^C
  unacked(bytes)                                     10.175.96.92                                          80
           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |                                         39       
               1 |                                         0        
               2 |                                         0        
               4 |                                         0        
               8 |                                         0        
              16 |                                         0        
              32 |                                         0        
              64 |                                         0        
             128 |                                         0        
             256 |                                         3        
             512 |                                         0        
            1024 |                                         0        
            2048 |                                         4        
            4096 |                                         9        
            8192 |                                         22       
           16384 |                                         37       
           32768 |@                                        99       
           65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   3858     
          131072 |                                         0        

  SWND(bytes)                                         10.175.96.92                                          80
           value  ------------- Distribution ------------- count    
             512 |                                         0        
            1024 |                                         1        
            2048 |                                         0        
            4096 |                                         2        
            8192 |                                         4        
           16384 |                                         7        
           32768 |                                         14       
           65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  1956     
          131072 |                                         0        

Monday Nov 14, 2011

Observing flow control idle time in TCP

Previously I described how to observe congestion control strategies during transmission, and here I talked about TCP's sliding window approach for handling flow control on the receive side. A neat trick would now be to put the pieces together and ask the following question - how often is TCP transmission blocked by congestion control (send-side flow control) versus a zero-sized send window (which is the receiver saying it cannot process any more data)? So in effect we are asking whether the size of the receive window of the peer or the congestion control strategy may be sub-optimal. The result of such a problem would be that we have TCP data that we could be transmitting but we are not, potentially effecting throughput.

So flow control is in effect:

  • when the congestion window is less than or equal to the amount of bytes outstanding on the connection. We can derive this from args[3]->tcps_snxt - args[3]->tcps_suna, i.e. the difference between the next sequence number to send and the lowest unacknowledged sequence number; and
  • when the window in the TCP segment received is advertised as 0
We time from these events until we send new data (i.e. args[4]->tcp_seq >= snxt value when window closes. Here's the script:

#!/usr/sbin/dtrace -s

#pragma D option quiet


tcp:::send
/ (args[3]->tcps_snxt - args[3]->tcps_suna) >= args[3]->tcps_cwnd /
{
        cwndclosed[args[1]->cs_cid] = timestamp;
        cwndsnxt[args[1]->cs_cid] = args[3]->tcps_snxt;
        @numclosed["cwnd", args[2]->ip_daddr, args[4]->tcp_dport] = count();
}

tcp:::send
/ cwndclosed[args[1]->cs_cid] && args[4]->tcp_seq >= cwndsnxt[args[1]->cs_cid] /
{
        @meantimeclosed["cwnd", args[2]->ip_daddr, args[4]->tcp_dport] =
            avg(timestamp - cwndclosed[args[1]->cs_cid]);
        @stddevtimeclosed["cwnd", args[2]->ip_daddr, args[4]->tcp_dport] =
            stddev(timestamp - cwndclosed[args[1]->cs_cid]);
        @numclosed["cwnd", args[2]->ip_daddr, args[4]->tcp_dport] = count();
        cwndclosed[args[1]->cs_cid] = 0;
        cwndsnxt[args[1]->cs_cid] = 0;
}

tcp:::receive
/ args[4]->tcp_window == 0 &&
  (args[4]->tcp_flags & (TH_SYN|TH_RST|TH_FIN)) == 0 /
{
        swndclosed[args[1]->cs_cid] = timestamp;
        swndsnxt[args[1]->cs_cid] = args[3]->tcps_snxt;
        @numclosed["swnd", args[2]->ip_saddr, args[4]->tcp_dport] = count();
}

tcp:::send
/ swndclosed[args[1]->cs_cid] && args[4]->tcp_seq >= swndsnxt[args[1]->cs_cid] /
{
        @meantimeclosed["swnd", args[2]->ip_daddr, args[4]->tcp_sport] =
            avg(timestamp - swndclosed[args[1]->cs_cid]);
        @stddevtimeclosed["swnd", args[2]->ip_daddr, args[4]->tcp_sport] =
            stddev(timestamp - swndclosed[args[1]->cs_cid]);
        swndclosed[args[1]->cs_cid] = 0;
        swndsnxt[args[1]->cs_cid] = 0;
}


END
{
        printf("%-6s %-20s %-8s %-25s %-8s %-8s\n", "Window",
            "Remote host", "Port", "TCP Avg WndClosed(ns)", "StdDev",
            "Num");
        printa("%-6s %-20s %-8d %@-25d %@-8d %@-8d\n", @meantimeclosed,
            @stddevtimeclosed, @numclosed);
}

So this script will show us whether the peer's receive window size is preventing flow ("swnd" events) or whether congestion control is limiting flow ("cwnd" events). As an example I traced on a server with a large file transfer in progress via a webserver and with an active ssh connection running "find / -depth -print". Here is the output:

^C
Window Remote host          Port     TCP Avg WndClosed(ns)     StdDev   Num     
cwnd   10.175.96.92         80       86064329                  77311705 125     
cwnd   10.175.96.92         22       122068522                 151039669 81  

So we see in this case, the congestion window closes 125 times for port 80 connections and 81 times for ssh. The average time the window is closed is 0.086sec for port 80 and 0.12sec for port 22.

So if you wish to change congestion control algorithm in Oracle Solaris 11, a useful step may be to see if congestion really is an issue on your network. Scripts like the one posted above can help assess this, but it's worth reiterating that if congestion control is occuring, that's not necessarily a problem that needs fixing. Recall that congestion control is about controlling flow to prevent large-scale drops, so looking at congestion events in isolation doesn't tell us the whole story. For example, are we seeing more congestion events with one control algorithm, but more drops/retransmission with another? As always, it's best to start with measures of throughput and latency before arriving at a specific hypothesis such as "my congestion control algorithm is sub-optimal".

Thursday Nov 10, 2011

DTracing TCP congestion control

In a previous post, I showed how we can use DTrace to probe TCP receive and send window events. TCP receive and send windows are in effect both about flow-controlling how much data can be received - the receive window reflects how much data the local TCP is prepared to receive, while the send window simply reflects the size of the receive window of the peer TCP. Both then represent flow control as imposed by the receiver.

However, consider that without the sender imposing flow control, and a slow link to a peer, TCP will simply fill up it's window with sent segments. Dealing with multiple TCP implementations filling their peer TCP's receive windows in this manner, busy intermediate routers may drop some of these segments, leading to timeout and retransmission, which may again lead to drops. This is termed congestion, and TCP has multiple congestion control strategies. We can see that in this example, we need to have some way of adjusting how much data we send depending on how quickly we receive acknowledgement - if we get ACKs quickly, we can safely send more segments, but if acknowledgements come slowly, we should proceed with more caution. More generally, we need to implement flow control on the send side also.

Slow Start and Congestion Avoidance

From RFC2581, let's examine the relevant variables:

"The congestion window (cwnd) is a sender-side limit on the amount of data the sender can transmit into the network before receiving an acknowledgment (ACK). Another state variable, the slow start threshold (ssthresh), is used to determine whether the slow start or congestion avoidance algorithm is used to control data transmission"

Slow start is used to probe the network's ability to handle transmission bursts both when a connection is first created and when retransmission timers fire. The latter case is important, as the fact that we have effectively lost TCP data acts as a motivator for re-probing how much data the network can handle from the sending TCP. The congestion window (cwnd) is initialized to a relatively small value, generally a low multiple of the sending maximum segment size. When slow start kicks in, we will only send that number of bytes before waiting for acknowledgement. When acknowledgements are received, the congestion window is increased in size until cwnd reaches the slow start threshold ssthresh value. For most congestion control algorithms the window increases exponentially under slow start, assuming we receive acknowledgements. We send 1 segment, receive an ACK, increase the cwnd by 1 MSS to 2*MSS, send 2 segments, receive 2 ACKs, increase the cwnd by 2*MSS to 4*MSS, send 4 segments etc. When the congestion window exceeds the slow start threshold, congestion avoidance is used instead of slow start.

During congestion avoidance, the congestion window is generally updated by one MSS for each round-trip-time as opposed to each ACK, and so cwnd growth is linear instead of exponential (we may receive multiple ACKs within a single RTT). This continues until congestion is detected. If a retransmit timer fires, congestion is assumed and the ssthresh value is reset. It is reset to a fraction of the number of bytes outstanding (unacknowledged) in the network. At the same time the congestion window is reset to a single max segment size. Thus, we initiate slow start until we start receiving acknowledgements again, at which point we can eventually flip over to congestion avoidance when cwnd > ssthresh.

Congestion control algorithms differ most in how they handle the other indication of congestion - duplicate ACKs. A duplicate ACK is a strong indication that data has been lost, since they often come from a receiver explicitly asking for a retransmission. In some cases, a duplicate ACK may be generated at the receiver as a result of packets arriving out-of-order, so it is sensible to wait for multiple duplicate ACKs before assuming packet loss rather than out-of-order delivery. This is termed fast retransmit (i.e. retransmit without waiting for the retransmission timer to expire). Note that on Oracle Solaris 11, the congestion control method used can be customized. See here for more details. In general, 3 or more duplicate ACKs indicate packet loss and should trigger fast retransmit . It's best not to revert to slow start in this case, as the fact that the receiver knew it was missing data suggests it has received data with a higher sequence number, so we know traffic is still flowing. Falling back to slow start would be excessive therefore, so fast recovery is used instead.

Observing slow start and congestion avoidance

The following script counts TCP segments sent when under slow start (cwnd <= ssthresh) versus congestion avoidance (cwnd > ssthresh).

#!/usr/sbin/dtrace -s

#pragma D option quiet

tcp:::connect-request
/ start[args[1]->cs_cid] == 0/
{
        start[args[1]->cs_cid] = 1;
}
tcp:::send
/ start[args[1]->cs_cid] == 1 &&
  args[3]->tcps_cwnd <= args[3]->tcps_cwnd_ssthresh /
{
        @c["Slow start", args[2]->ip_daddr, args[4]->tcp_dport] = count();
}

tcp:::send
/ start[args[1]->cs_cid] == 1 &&
  args[3]->tcps_cwnd > args[3]->tcps_cwnd_ssthresh /
{
        @c["Congestion avoidance", args[2]->ip_daddr, args[4]->tcp_dport] =
            count();
}

As we can see the script only works on connections initiated since it is started (using the start[] associative array with the connection ID as index to set whether it's a new connection (start[cid] = 1). From there we simply differentiate send events where cwnd <= ssthresh (slow start) versus those where cwnd > ssthresh (congestion avoidance). Here's the output taken when I accessed a YouTube video (where rport is 80) and from an FTP session where I put a large file onto a remote system.

# dtrace -s tcp_slow_start.d
^C
ALGORITHM                 RADDR                     RPORT      #SEG
Slow start                10.153.125.222               20         6
Slow start                138.3.237.7                  80        14
Slow start                10.153.125.222               21        18
Congestion avoidance      10.153.125.222               20      1164

We see that in the case of the YouTube video, slow start was exclusively used. Most of the segments we sent in that case were likely ACKs. Compare this case - where 14 segments were sent using slow start - to the FTP case, where only 6 segments were sent before we switched to congestion avoidance for 1164 segments. In the case of the FTP session, the FTP data on port 20 was predominantly sent with congestion avoidance in operation, while the FTP session relied exclusively on slow start.

For the default congestion control algorithm - "newreno" - on Solaris 11, slow start will increase the cwnd by 1 MSS for every acknowledgement received, and by 1 MSS for each RTT in congestion avoidance mode. Different pluggable congestion control algorithms operate slightly differently. For example "highspeed" will update the slow start cwnd by the number of bytes ACKed rather than the MSS.

And to finish, here's a neat oneliner to visually display the distribution of congestion window values for all TCP connections to a given remote port using a quantization. In this example, only port 80 is in use and we see the majority of cwnd values for that port are in the 4096-8191 range.

# dtrace -n 'tcp:::send { @q[args[4]->tcp_dport] = quantize(args[3]->tcps_cwnd); }'
dtrace: description 'tcp:::send ' matched 10 probes
^C

    80
           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |@@@@@@                                   5        
               1 |                                         0        
               2 |                                         0        
               4 |                                         0        
               8 |                                         0        
              16 |                                         0        
              32 |                                         0        
              64 |                                         0        
             128 |                                         0        
             256 |                                         0        
             512 |                                         0        
            1024 |                                         0        
            2048 |@@@@@@@@@                                8        
            4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@               23       
            8192 |                                         0        

Friday Jun 25, 2010

Measuring RTT using DTrace while dealing realistically with TCP's acknowledgement model

In a previous entry, I mentioned that the TCP round-trip time latency script I demonstrated there had some shortcomings. It's time to be a bit more specific as we dig a bit deeper into TCP.

The most glaring problem is the over-simplistic assumption it makes about TCP's acknowledgement model. Recall we traced tcp:::send events, noting what the next sequence number after that send event would be. For transmissions (as opposed to retransmissions), this corresponds the value stored in tcps_snxt in the tcpsinfo_t. We assumed that we would always get an acknowledgement specifying that sequence number and that there is a 1:1 relationship between segments and ACKs with each ACK acknowledging a corresponding segment. We used that assumption to match up ACKs to the original send times, and thus determined round-trip time. This assumption about TCP behaviour is not valid, but it works enough of the time for us to be able to get some RTT measurements. However, it is clearly both unreliable and wasteful to fill aggregations with timestamps for each segment and only use a subset - if any - for RTT measurements.

In reality, when TCP acknowledges the sequence number N, it is also acknowledging all data up to sequence number N. One example of where this is used is delayed acknowledgement. On receiving a segment, the receiving TCP does not always immediately acknowledge it - often delaying for a period on the grounds there may be data to send in response, rather then just sending an ACK segment. As long as no other data is received, TCP can delay before sending the ACK, but if more data is then received, an ACK covering multiple segments may be sent. We need to fix our RTT script to deal with this fact. The best way to do this is to store the first sequence number we encounter on a per-connection-id basis, along with the associated send timestamp. Then when we receive a segment, we check if it's ACK is greater than the stored send sequence number for that connection, we know it is ACKing at least that segment. This represents one RTT sample, and once this sample has been collected we are free to store another send sequence number for a subsequent send event. This deals with both the simplistic 1:1 segment:ack assumption of the previous script, and the wasteful use of aggregation space. Using this approach, only 1 aggregation value is used per connection at any one time instead of using one aggregation value per sent segment. The old approach had the potential to fill up aggregations with data as TCP sent segment after segment before receiving an ACK - the sliding window behaviour we discussed earlier.

So let's see the script:

#!/usr/sbin/dtrace -s

#pragma D option quiet


tcp:::send
/ (args[4]->tcp_flags & (TH_SYN|TH_RST|TH_FIN)) == 0 && unacked[args[1]->cs_cid] == 0 /
{
	seq[args[1]->cs_cid] = args[4]->tcp_seq;
	start[args[1]->cs_cid] = timestamp;
	unacked[args[1]->cs_cid] = 1;
}


tcp:::receive
/ unacked[args[1]->cs_cid] == 1 && args[4]->tcp_ack > seq[args[1]->cs_cid] /
{
	@meanrtt[args[2]->ip_saddr, args[4]->tcp_sport, args[1]->cs_cid] = 
	     avg(timestamp - start[args[1]->cs_cid]);
	@stddevrtt[args[2]->ip_saddr, args[4]->tcp_sport, args[1]->cs_cid] = 
	     stddev(timestamp - start[args[1]->cs_cid]);
	@minrtt[args[2]->ip_saddr, args[4]->tcp_sport, args[1]->cs_cid] = 
	     min(timestamp - start[args[1]->cs_cid]);
	@maxrtt[args[2]->ip_saddr, args[4]->tcp_sport, args[1]->cs_cid] = 
	     max(timestamp - start[args[1]->cs_cid]);
	@countrtt[args[2]->ip_saddr, args[4]->tcp_sport, args[1]->cs_cid] = 
	     count(); 
	unacked[args[1]->cs_cid] = 0;
}

END
{
	printf("%-20s %-6s %-10s %-10s %-10s %-10s %-8s\\n",
            "Remote host", "Port", "AvgRTT(ns)", "StdDev", "Min", "Max", "#Samples");
	printa("%-20s %-6d %@-10d %@-10d %@-10d %@-10d %@-8d\\n", @meanrtt, @stddevrtt, 
	    @minrtt, @maxrtt, @countrtt);
}

We record average, standard deviation, min and max RTT for non-control segments on a per-connection basis. At any one time, a connection can have at most one sequence number stored in the unacked[] associative array, and when this is acked, we record the RTT and reset unacked[connection-id] to 0 to allow for another sample. The upshot is we can only record one RTT at a time. Here's some output showing RTTs for a HTTP load:

# dtrace -s tcp_rtt4.d
\^C
Remote host          Port   AvgRTT(ns) StdDev     Min        Max        #Samples
82.195.132.153       80     86295028   0          86295028   86295028   1       
64.236.124.228       80     234437989  178807729  57704050   500949801  4       

We can see that there is an almost 10x difference between minimum and maximum RTT measurements for 64.236.124.228, ranging from 57.7msec to 500.9msec.

We've seen how long it takes from TCP segment transmission to receipt of acknowledgement. Now let's probe how long TCP takes from receiving a segment to sending an acknowledgement - the acknowledgement latency. As noted above, often TCP delays acknowledgement, but the delayed acknowledgement scheme can be switched off using the TCP_NODELAY socket option. By running apache on a local system and measuring time between segment receive and acknowledgement, and comparing this to the same measurement for the requesting TCP connection (from the wget process), we can see that apache does not appear to use delayed acknowledgement, so probably specifies the TCP_NODELAY socket option (subsequent checking indeed confirms this). By contrast the long delays associated with the localhost request (from a wget(1) command) suggest that wget does not use TCP_NODELAY. The following output shows results of "wget http://127.0.0.1", and we can see that the ACK times for the port 80 side of the connection is significantly faster than for the anon wget-requesting port (61670):

# dtrace -s tcp_ack.d           
\^C
Remote host          Port   AvgACK(ns) StdDev     Min        Max        #Samples
127.0.0.1            80     16352199   21798743   230152     47169266   3       
127.0.0.1            61670  1640239644 0          1640239644 1640239644 1       

Here's the script that measures ACK latency:

#!/usr/sbin/dtrace -s

#pragma D option quiet


tcp:::receive
/ (args[4]->tcp_flags & (TH_SYN|TH_RST|TH_FIN)) == 0 &&
  unacked[args[1]->cs_cid] == 0 /
{
 	seq[args[1]->cs_cid] = args[4]->tcp_seq;
	start[args[1]->cs_cid] = timestamp;
	unacked[args[1]->cs_cid] = 1;
}


tcp:::send
/ unacked[args[1]->cs_cid] == 1 && args[4]->tcp_ack > seq[args[1]->cs_cid] /
{
 	@meanack[args[2]->ip_daddr, args[4]->tcp_dport, args[1]->cs_cid] =
             avg(timestamp - start[args[1]->cs_cid]);
	@stddevack[args[2]->ip_daddr, args[4]->tcp_dport, args[1]->cs_cid] =
             stddev(timestamp - start[args[1]->cs_cid]);
	@minack[args[2]->ip_daddr, args[4]->tcp_dport, args[1]->cs_cid] =
             min(timestamp - start[args[1]->cs_cid]);
	@maxack[args[2]->ip_daddr, args[4]->tcp_dport, args[1]->cs_cid] =
             max(timestamp - start[args[1]->cs_cid]);
	@countack[args[2]->ip_daddr, args[4]->tcp_dport, args[1]->cs_cid] =
             count();
	unacked[args[1]->cs_cid] = 0;
}

END
{
 	printf("%-20s %-6s %-10s %-10s %-10s %-10s %-8s\\n",
            "Remote host", "Port", "AvgACK(ns)", "StdDev", "Min", "Max",
            "#Samples");
	printa("%-20s %-6d %@-10d %@-10d %@-10d %@-10d %@-8d\\n", @meanack,
            @stddevack, @minack, @maxack, @countack);
}

Wednesday Jun 23, 2010

DTrace tcp provider and TCP segment sizes

At CEC2006, Brendan and Bryan showed how we can use a neat feature of DTrace in conjunction with the DTrace tcp provider - we can represent byte sizes sent and received using a quantized aggregation, which visually represents the number of values that fall within ranges n to 2n, 2n to 4n, etc. This is particularly useful for TCP, since for peak throughput we would expect a bimodal distribution with one spike at 0 (for control segments with no payload) and another in the interval containing the maximum segment size (MSS) for the TCP connection. Viewing send and receive quantized aggregations (aggregated by remote host IP address) in conjunction with the mean MSS for connections to the remote host, we do indeed see this pattern on the receive side for 199.222.69.17 in the case below, but obviously the results will vary depending on the TCP connection load characteristics. In this case, the load consisted of a HTTP request/response, the former of which will often fall below the MSS in size (we see 1 TCP send in the 512-1024 range), while the receive segments are nearly all (32) in the MSS range (1024-2047). For localhost (127.0.0.1, :::1), we see a number of control segments of 0 size only.


# dtrace -s tcp_payloadsize.d
\^C
MSS (bytes)          199.222.69.17            1372                
MSS (bytes)          127.0.0.1                4116                
MSS (bytes)          ::1                      4126                

  Sent segment size (bytes)                           127.0.0.1                                         
           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 18       
               1 |                                         0        

  Sent segment size (bytes)                           ::1                                               
           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 18       
               1 |                                         0        

  Sent segment size (bytes)                           199.222.69.17                                     
           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   21       
               1 |                                         0        
               2 |                                         0        
               4 |                                         0        
               8 |                                         0        
              16 |                                         0        
              32 |                                         0        
              64 |                                         0        
             128 |                                         0        
             256 |                                         0        
             512 |@@                                       1        
            1024 |                                         0        

  Received segment size (bytes)                       127.0.0.1                                         
           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 18       
               1 |                                         0        

  Received segment size (bytes)                       ::1                                               
           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 18       
               1 |                                         0        

  Received segment size (bytes)                       199.222.69.17                                     
           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |@@                                       2        
               1 |                                         0        
               2 |                                         0        
               4 |                                         0        
               8 |                                         0        
              16 |                                         0        
              32 |                                         0        
              64 |                                         0        
             128 |                                         0        
             256 |@                                        1        
             512 |                                         0        
            1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    32       
            2048 |                                         0        

Here is the script that gathered these results. The MSS is averaged as it can change over time - more on that later.

#!/usr/sbin/dtrace -s

#pragma D option quiet

tcp:::send
{
 	@transmit["Sent segment size (bytes)", args[2]->ip_daddr] =
            quantize(args[2]->ip_plength - args[4]->tcp_offset);
	@mss["MSS (bytes)", args[2]->ip_daddr] =
            avg(args[3]->tcps_mss);
}

tcp:::receive
{
 	@receive["Received segment size (bytes)", args[2]->ip_saddr] =
            quantize(args[2]->ip_plength - args[4]->tcp_offset);
	@mss["MSS (bytes)", args[2]->ip_saddr] =
            avg(args[3]->tcps_mss);
}

END
{
        printa("%-20s %-24s %@-20d\\n", @mss);
}

Monitoring TCP retransmission using the DTrace tcp provider

Thus far, I've tried to cover how we probe some of the basic features of TCP using the DTrace TCP provider, touching on connection establishment (measuring connection and first-byte latency), TCP's acknowledgement scheme (measuring acknowledgment latency) and TCP's sliding window model (measuring zero window idle time). Before moving on to more advanced topics, I'd like to look at another basic aspect of TCP's acknowledgement-driven reliability scheme - retransmission.

When TCP sends a segment, a retransmission timer is started, and if no acknowledgement is received for that segment after the retransmission timer has expired, it is assumed lost and will be retransmitted. Such retransmission events are easy to pick out using the DTrace provider - the tcps_retransmit field of args[3] (the tcpsinfo_t representing TCP state associated with the connection) will be 1 for tcp:::send events corresponding to retransmission. The following script monitors how many payload (i.e. non-header) bytes are transmitted and retransmitted for each remote host/port:

#!/usr/sbin/dtrace -s

#pragma D option quiet

tcp:::send
/ (args[2]->ip_plength - args[4]->tcp_offset) > 0 / 
{
	@transmit[args[2]->ip_daddr, args[4]->tcp_dport] =
	    sum(args[2]->ip_plength - args[4]->tcp_offset);
}

tcp:::send
/ (args[2]->ip_plength - args[4]->tcp_offset) > 0 &&
  args[3]->tcps_retransmit == 1/
{
	@retransmit[args[2]->ip_daddr, args[4]->tcp_dport] =
	    sum(args[2]->ip_plength - args[4]->tcp_offset);
}

END
{
	printf("%-25s %-15s %-15s %-15s\\n",
	    "Remote host", "Port", "BytesSent", "BytesResent");
	printa("%-25s %-15d %@-15d %@-15d\\n",
	    @transmit, @retransmit);
}

And here's some sample output showing retransmission events for 212.147.135.190/80 and 87.248.210.254/80 - 1042 bytes out of 4189 and 579 bytes out of 5811 were retransmitted respectively. When the ratio of retrasmitted to transmitted bytes is high, further investigation is merited.

#  dtrace -s tcp_retransmit.d
\^C
Remote host               Port            BytesSent       BytesResent    
66.102.9.149              80              451             0              
65.242.27.32              80              837             0              
207.123.61.126            80              2558            0              
159.134.196.136           80              2573            0              
216.34.207.180            80              2852            0              
212.147.135.190           80              4189            1042           
87.248.210.254            80              5811            579            
89.207.56.140             80              5922            0              
66.102.9.144              80              9174            0              
82.199.80.141             80              14626           0              

DTrace TCP provider and TCP's sliding window

In the last post, I mentioned I'd demonstrate how we can probe TCP sliding window behaviour. When examining performance problems relating to sliding window on a system, the key thing to determine is if we are encountering situations where either:

  • the send window of any TCP connections is zero. This indicates we can no longer send data since the peer connection is advertising a zero-sized window.
  • the receive window of any TCP connections is zero. This indicates we can no longer receive data since we are advertising a zero-sized window.

If either of these situations occurs regularly, we will then need to dig down and find out why the send or receive windows are filling up, but the first step is to determine if these events are occuring, and if so, how long are we stuck unable to send or receive new data? We can do this by timing the interval between zero window advertisement send and new data receive (for receive window) and receipt of a zero window advertisement and new data send (for send window). Here is a script which averages zero receive window idletime by remote address/port. This is the average time between when we advertise a zero window and when we resume receiving new data.

#!/usr/sbin/dtrace -s

#pragma D option quiet


tcp:::send
/ args[4]->tcp_window == 0 && (args[4]->tcp_flags & TH_RST) == 0 /
{
        rwndclosed[args[1]->cs_cid] = timestamp;
	rwndrnxt[args[1]->cs_cid] = args[3]->tcps_rnxt;
        @numrwndclosed[args[2]->ip_daddr, args[4]->tcp_dport] = count();
}

tcp:::receive
/ rwndclosed[args[1]->cs_cid] && args[4]->tcp_seq >= rwndrnxt[args[1]->cs_cid] /
{
        @meantimeclosed[args[2]->ip_saddr, args[4]->tcp_sport] =
            avg(timestamp - rwndclosed[args[1]->cs_cid]);
	@stddevtimeclosed[args[2]->ip_saddr, args[4]->tcp_sport] =
            stddev(timestamp - rwndclosed[args[1]->cs_cid]);
	rwndclosed[args[1]->cs_cid] = 0;
        rwndrnxt[args[1]->cs_cid] = 0;
}

END
{
        printf("%-20s %-8s %-25s %-8s %-8s\\n",
            "Remote host", "Port", "TCP Avg RwndClosed(ns)", "StdDev",
            "Num");
        printa("%-20s %-8d %@-25d %@-8d %@-8d\\n", @meantimeclosed,
            @stddevtimeclosed, @numrwndclosed);
}

Here's the output showing one receive window close event, triggered by opening a hi-res YouTube video in a browser.

# dtrace -s tcp_rwndclosed.d
\^C
Remote host          Port     TCP Avg RwndClosed(ns)    StdDev   Num     
92.122.127.159       80       26914620                  0        1       

We can see we were stuck unable to receive new data for 0.269 seconds since our receive window was 0. Here's the send window equivalent:

#!/usr/sbin/dtrace -s

#pragma D option quiet


tcp:::receive
/ args[4]->tcp_window == 0 && (args[4]->tcp_flags & TH_RST) == 0 /
{
        swndclosed[args[1]->cs_cid] = timestamp;
	swndsnxt[args[1]->cs_cid] = args[3]->tcps_snxt;
	@numswndclosed[args[2]->ip_saddr, args[4]->tcp_sport] = count();
}

tcp:::send
/ swndclosed[args[1]->cs_cid] && args[4]->tcp_seq >= swndsnxt[args[1]->cs_cid] /
{
        @meantimeclosed[args[2]->ip_daddr, args[4]->tcp_dport] =
            avg(timestamp - swndclosed[args[1]->cs_cid]);
	@stddevtimeclosed[args[2]->ip_daddr, args[4]->tcp_dport] =
            stddev(timestamp - swndclosed[args[1]->cs_cid]);
	swndclosed[args[1]->cs_cid] = 0;
	swndsnxt[args[1]->cs_cid] = 0;
}

END
{
        printf("%-20s %-8s %-25s %-8s %-8s\\n",
            "Remote host", "Port", "TCP Avg SwndClosed(ns)", "StdDev",
            "Num");
        printa("%-20s %-8d %@-25d %@-8d %@-8d\\n", @meantimeclosed,
            @stddevtimeclosed, @numswndclosed);
}

If a lot of zero-window events are occuring, a good next step would be to use the acknowledgement RTT latency measurement script in the previous entry - this will help determine if the ACK RTT is sufficiently high such that the send window fills up before any data is acknowledged.

Tuesday Jun 22, 2010

Improved DTrace tcp acknowledgement RTT latency script

In the previous post, I concluded with a rough script that measures mean acknowledgement latency averaged across remote host. Acknowledgement latency or round-trip time (RTT) is the time spent waiting between when a TCP segment is sent and an acknowledgement for that data is received. Acknowledgment round-trip time is an important measure as it relates how much data a TCP connection can send - if data is acknowledged quickly, TCP's sliding window can progress. As noted here, for the highest possible throughput, it is important that the transmitter is not forced to stop sending by the sliding window protocol earlier than one round-trip delay time (RTT) - in other words, the window should not fill up before the earliest data is acknowledged, otherwise the sender cannot send any more data until that acknowledgement is received. A number of factors may contribute to these kinds of delays - small advertised windows, slow protocol processing etc - but the important point is we can now use DTrace to measure both RTT and examine how the sliding window evolves over time. Here's a better version of the previous script, breaking down RTT by host and port.

#!/usr/sbin/dtrace -s

#pragma D option quiet

tcp:::send
{
	snxt[args[1]->cs_cid, args[3]->tcps_snxt] = timestamp;
	bytesrttttmp[args[1]->cs_cid, args[3]->tcps_snxt] =
	    args[2]->ip_plength - args[4]->tcp_offset;
}

tcp:::receive
/ snxt[args[1]->cs_cid, args[4]->tcp_ack] /
{
	@bytesrtt[args[2]->ip_saddr, args[4]->tcp_sport] =
	    sum(bytesrttttmp[args[1]->cs_cid, args[4]->tcp_ack]);
	@meanrtt[args[2]->ip_saddr, args[4]->tcp_sport] = 
	    avg(timestamp - snxt[args[1]->cs_cid, args[4]->tcp_ack]);
	@stddevrtt[args[2]->ip_saddr, args[4]->tcp_sport] =
            stddev(timestamp - snxt[args[1]->cs_cid, args[4]->tcp_ack]);
	@countrtt[args[2]->ip_saddr, args[4]->tcp_sport] = count();
	snxt[args[1]->cs_cid, args[4]->tcp_ack] = 0;
	bytesrttttmp[args[1]->cs_cid, args[4]->tcp_ack] = 0;
}

END
{
	printf("%-20s %-8s %-15s %-15s %-10s %-7s\\n",
	    "Remote host", "Port", "TCP Avg RTT(ns)", "StdDev", "NumBytes",
	    "NumSegs");
	printa("%-20s %-8d %@-15d %@-15d %@-10d %@-7d\\n", @meanrtt,
	    @stddevrtt, @bytesrtt, @countrtt);
}

And here's some sample output:

# dtrace -s tcp_rtt2.d
\^C
Remote host          Port     TCP Avg RTT(ns) StdDev          NumBytes   NumSegs
127.0.0.1            9010     109074          79551           0          15     
127.0.0.1            59571    204945          0               0          1      
127.0.0.1            35748    205471          0               0          1      
127.0.0.1            39697    206126          0               0          1      
127.0.0.1            58889    207170          0               0          1      
127.0.0.1            57647    208967          0               0          1      
127.0.0.1            53502    215569          0               0          1      
127.0.0.1            41038    221163          0               0          1      
127.0.0.1            36777    226006          0               0          1      
::1                  9010     233492          186640          0          15     
127.0.0.1            41241    256225          0               0          1      
127.0.0.1            57388    297646          0               0          1      
127.0.0.1            55260    424590          0               0          1      
127.0.0.1            60660    445139          0               0          1      
127.0.0.1            56460    620226          0               0          1      
127.0.0.1            55449    670529          0               0          1      
127.0.0.1            56877    772734          0               0          1      
::1                  62343    1532445         0               0          1      
::1                  51756    1538747         0               0          1      
::1                  41825    1627673         0               0          1      
::1                  56580    1965449         0               0          1      
::1                  65061    2412497         0               0          1      
::1                  50031    2726626         0               0          1      
159.134.196.178      80       2788676         1773030         0          6      
::1                  43043    3465254         0               0          1      
::1                  57685    4454573         0               0          1      
::1                  55854    4490483         0               0          1      
92.123.196.20        80       19855492        0               0          1      
::1                  50972    20481704        0               0          1      
::1                  48172    53565332        0               0          1      
66.235.139.54        80       81219406        80478253        0          2      
::1                  37341    95086544        0               0          1      
::1                  32770    1512858144      0               0          1      
::1                  49710    4197402456      0               0          1      

Next time, we'll look at TCP's sliding window behaviour.

Measuring latencies with the DTrace TCP provider

As I mentioned in the previous entry, the DTrace tcp provider can be used to carry out a number of useful latency measurements. We have included some scripts that help with these measurements, and I'll try and explain them here. Another useful reference is Bryan and Brendan's CEC2006 presentation, though some minor details have changed in the TCP provider arguments since then - consult the updated docs when in doubt.

In the last entry, we used the tcp:::send and tcp:::receive probe points exclusively. There are also a number of probe points relating to TCP connection establishment:

tcp:::connect-request - initial SYN sent as part of three-way handshake to establish active connection
tcp:::connect-refused - intial SYN was refused and we receive a RST|ACK "reset" segment
tcp:::connect-established - we send a final ACK, completing establishment of the active connection
tcp:::accept-refused - initial SYN has been received, but nothing is listening. Send RST|ACK "reset" segment
tcp:::accept-established - final ACK has been received and the connection is established passively for the listener
The connect-\* probes relate to the initiating or active side of the TCP connection, while the accept-\* probes are for the listening or passive side of the connection.

Connection latency

The tcpconnlat.d scripts measures the latency incurred for the TCP 3-way handshake to complete for outbound connections. The tcpconnlat.d script measures this using a quantized aggregation - in other words connection latencies are grouped into buckets up to latency n to 2n, 2n to 4n etc. Below are the results of running the script on one window while running a number of wgets to various websites:

# dtrace -s /usr/demo/dtrace/tcpconnlat.d
dtrace: script '/usr/demo/dtrace/tcpconnlat.d' matched 2 probes
\^C

  Connect Latency (ns)                                64.210.72.58                                      
           value  ------------- Distribution ------------- count    
        16777216 |                                         0        
        33554432 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1        
        67108864 |                                         0        

  Connect Latency (ns)                                74.125.45.103                                     
           value  ------------- Distribution ------------- count    
        16777216 |                                         0        
        33554432 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1        
        67108864 |                                         0        

  Connect Latency (ns)                                192.9.164.55                                      
           value  ------------- Distribution ------------- count    
        33554432 |                                         0        
        67108864 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1        
       134217728 |                                         0        

  Connect Latency (ns)                                192.9.164.72                                      
           value  ------------- Distribution ------------- count    
        33554432 |                                         0        
        67108864 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1        
       134217728 |                                         0        

The script works by measuring time between the firing of the tcp:::connect-request probe (which fires when the initial SYN of the three-way handshake is sent) and the final ACK is sent (indicating the connection has been established). The script uses the unique-per-connection connection id (cs_cid) to record the starting timestamp for comparison.

tcp:::connect-request
{
        start[args[1]->cs_cid] = timestamp;
}

tcp:::connect-established
/start[args[1]->cs_cid] /
{
        @latency["Connect Latency (ns)", args[3]->tcps_raddr] =
            quantize(timestamp - start[args[1]->cs_cid]);
        start[args[1]->cs_cid] = 0;
}

First-byte latency

First byte latency is the time from connection establishment to when data is received. The script, tcp1stbyte.d, is geared towards the initiating connection (since it times from tcp:::connect-established to tcp:::receive firing), but could be rewritten to be more general by replacing "tcp:::connect-established {}" with "tcp:::state-change / args[3]->tcps_state == TCP_STATE_ESTABLISHED/ {}").

# dtrace -s /usr/demo/dtrace/tcp1stbyte.d
dtrace: script '/usr/demo/dtrace/tcp1stbyte.d' matched 16 probes
\^C

  1st Byte Latency (ns)                               89.207.56.140                                     
           value  ------------- Distribution ------------- count    
        16777216 |                                         0        
        33554432 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1        
        67108864 |                                         0        

  1st Byte Latency (ns)                               66.102.9.99                                       
           value  ------------- Distribution ------------- count    
        33554432 |                                         0        
        67108864 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1        
       134217728 |                                         0        

Again, the script is straightforward:

tcp:::connect-established
{
        start[args[1]->cs_cid] = timestamp;
}

tcp:::receive
/start[args[1]->cs_cid] && (args[2]->ip_plength - args[4]->tcp_offset) > 0/
{
        @latency["1st Byte Latency (ns)", args[2]->ip_saddr] =
            quantize(timestamp - start[args[1]->cs_cid]);
        start[args[1]->cs_cid] = 0;
}

Round-trip latency

Here's a script not included in the demo scripts, and even though there are edge cases it probably won't cover, it's a useful demonstration of some additional features of the TCP provider. It measures average round-trip time from sending to acknowledgement of TCP segments on a per-host basis. Here's the output:

# dtrace -s tcp_rtt.d
\^C
Remote host               TCP Avg RTT(ns) StdDev          NumBytes   NumSegs   
127.0.0.1                 255635          162214          0          4         
::1                       2718049         3127457         0          4         
66.102.9.99               25552916        27326796        230        12        
89.207.56.140             28347521        57559039        110        28        
74.125.105.87             30258841        0               0          1         

We see localhost RTTs are smallest for IPv4, then IPv6, then remote hosts. It is interesting that IPv6 localhost RTT is so much greater. The script itself is a bit more complex:

#!/usr/sbin/dtrace -s

#pragma D option quiet

tcp:::send
{
	snxt[args[1]->cs_cid, args[3]->tcps_snxt] = timestamp;
	@bytesrtt[args[2]->ip_daddr] =
	    sum(args[2]->ip_plength - args[4]->tcp_offset);
	@countrtt[args[2]->ip_daddr] = count();
}

tcp:::receive
/ snxt[args[1]->cs_cid, args[4]->tcp_ack] /
{
	@meanrtt[args[2]->ip_saddr] = 
	    avg(timestamp - snxt[args[1]->cs_cid, args[4]->tcp_ack]);
	@stddevrtt[args[2]->ip_saddr] =
	    stddev(timestamp - snxt[args[1]->cs_cid, args[4]->tcp_ack]);
	snxt[args[1]->cs_cid, args[4]->tcp_ack] = 0;
}

END
{
	printf("%-25s %-15s %-15s %-10s %-10s\\n",
	    "Remote host", "TCP Avg RTT(ns)", "StdDev", "NumBytes", "NumSegs");
	printa("%-25s %@-15d %@-15d %@-10d %@-10d\\n",
	    @meanrtt, @stddevrtt, @bytesrtt, @countrtt);
}

We use an associative array snxt, which stores timestamps associated with a particular connection and the next segment sequence number that will be sent. The tcpsinfo value tcps_snxt represents the next sequence number that will be used for sending, so at a tcp:::send probe point, it represents the ACK sequence number we would expect to receive for that segment. Recall, when we acknowledge a TCP segment, we acknowledge it with the next sequence number we expect to receive, so when this segment is acknowledged, we will most probably receive a segment with that sequence number specified as the acknowledgement number. So when we receive a segment for that connection _and_ having that sequence number as acknowledgement, we can use the original send timestamp to determine round-trip acknowledgement time. This is then averaged across all connections for the remote host to get a mean acknowledgement RTT. Note that this scheme will break down for selective acknowledgement (among other cases) but it is a neat demonstration of the kind of measurement that is possible, and represents an easy way to find slow peer systems. Suspiciously high round trip times would merit further investigation, especially in cases where machines are on the same network segments and thus should have broadly similar latencies. It should be noted that lumping all the ports together (as this script does) leads to high standard deviations for average round-trip times, so a drilldown by port would likely be merited.

Sunday Jun 20, 2010

DTrace TCP, UDP providers

We recently integrated DTrace TCP and UDP providers into Solaris/OpenSolaris - they will appear in build 142. I should mention at the outset that Brendan did the lion's share of this work - I just had to merge his changes with the latest Solaris Nevada builds and do the paperwork mostly.

The first thing to note is that there is a great advantage in tracing network events above IP in the networking stack - at that point both the inbound and outbound data has been mapped to a connection, which in turn is tied to a processs, a zone etc. This allows us to represent that information (along with an identifier that uniquely represents that connection for its lifetime) at the various probe points via a csinfo_t (Connection State INFO Type) probe argument:

typedef struct csinfo {
        uintptr_t cs_addr;
	uint64_t cs_cid;     /\* connection ID \*/
	pid_t cs_pid;        /\* associated PID \*/
	zoneid_t cs_zoneid;  /\* associated zoneid \*/
} csinfo_t;

There is great utility in being able to tie TCP and UDP events to a particular process, as illustrated by the tcptop.d script in /usr/demo/dtrace, which is essentially Brendan's DTraceToolkit tcptop using the stable tcp provider. Here's some sample output - we see the top processes as measured by TCP payload bytes (when 0s appear, they signify connections sending control segments like ACKs which have no payload).

# dtrace -s /usr/demo/dtrace/tcptop.d
Sampling... Please wait.
2010 Jun  4 07:56:56,  load: 0.36,  TCPin:      0 Kb,  TCPout:      0 Kb

  ZONE    PID LADDR           LPORT RADDR           RPORT      SIZE
     0 102238 127.0.0.1       49668 127.0.0.1        9010         0
     0 102238 ::1             50795 ::1              9010         0

2010 Jun  4 07:57:01,  load: 0.35,  TCPin:      0 Kb,  TCPout:      0 Kb

  ZONE    PID LADDR           LPORT RADDR           RPORT      SIZE
     0 102238 127.0.0.1       41433 127.0.0.1        9010         0
     0 102238 ::1             36782 ::1              9010         0
     0 132388 127.0.0.1       53480 127.0.0.1        9010         0
     0 132388 127.0.0.1       62234 127.0.0.1        9010         0
     0 132388 ::1             50266 ::1              9010         0
     0 132388 ::1             63145 ::1              9010         0

2010 Jun  4 07:57:06,  load: 0.33,  TCPin:      0 Kb,  TCPout:      0 Kb

  ZONE    PID LADDR           LPORT RADDR           RPORT      SIZE
     0 132388 127.0.0.1       47516 127.0.0.1        9010         0
     0 132388 127.0.0.1       48726 127.0.0.1        9010         0
     0 132388 ::1             57467 ::1              9010         0
     0 132388 ::1             62543 ::1              9010         0

2010 Jun  4 07:57:11,  load: 0.32,  TCPin:      0 Kb,  TCPout:      0 Kb

  ZONE    PID LADDR           LPORT RADDR           RPORT      SIZE
     0 132388 129.150.120.230 56142 141.146.118.10    993       247

\^C

These events are collected via the tcp:::send and tcp:::receive probe firing, and since we are interested in valid pids, we exclude TCP segments which have no associated process (such as RST|ACK segments generated in response to inbound connection requests for ports where no TCP service is listening). Here we collect the details in the @out aggregation:

tcp:::send
/ args[1]->cs_pid != -1 /
{
 	@out[args[1]->cs_zoneid, args[1]->cs_pid, args[2]->ip_saddr,
            args[4]->tcp_sport, args[2]->ip_daddr, args[4]->tcp_dport] =
            sum(args[2]->ip_plength - args[4]->tcp_offset);
}

As a quick way to see which processes/zones are busy, this is pretty useful. To drill down to individual segments, tcpsnoop can be used:

# dtrace -s /usr/demo/dtrace/tcpsnoop.d
  TIME          PID           LADDR:PORT                 RADDR:PORT   BYTES FLAGS
735575877102 102238       127.0.0.1:45518  ->        127.0.0.1:9010       0 (SYN)
735575877139     -1       127.0.0.1:9010   <-        127.0.0.1:45518      0 (SYN)
735575877168     -1       127.0.0.1:9010   ->        127.0.0.1:45518      0 (RST|ACK)
735575877230 102238       127.0.0.1:45518  <-        127.0.0.1:9010       0 (RST|ACK)
735581534280 132408 129.150.120.230:38737  ->   216.129.112.94:80       712 (PUSH|ACK)
735581725757 132408 129.150.120.230:38737  <-   216.129.112.94:80         0 (ACK)
735581735388 132408 129.150.120.230:38737  <-   216.129.112.94:80       356 (PUSH|ACK)
735581850134 132408 129.150.120.230:38737  ->   216.129.112.94:80         0 (ACK)
\^C

As mentioned above, note how RST|ACKs sent specify a PID of -1 (NOPID) - this indicates that they are generated by the kernel protocol stack and have no associated process. Notice also that they are sent and received locally, so they appear twice, as does all localhost traffic.

Similar udptop.d and udpsnoop.d demo scripts are also available in /usr/demo/dtrace.

From the above, I learned a general lesson about DTrace providers - while it is clearly of great benefit to add a stable provider to allow probing into a subsystem, it is of even greater benefit if that provider can tie the events it probes to other system abstractions (in this case, we link network protocol processing to user processes and system zones).

The other thing to emphasize is that the above DTrace probes are not intended as replacements for snoop or wireshark - those tools do packet sniffing on the wire (often in promiscuous mode), whereas the above scripts monitor traffic that reaches IP and is fanned out to TCP or UDP connections on the inbound path, or has not yet reached IP (on the outbound path). So, in other words, a snoop trace could look quite different. The primary goal of this work (at least to me) is not to replace those packet-sniffing tools, but rather to augment the kinds of analysis we can do - in particular we can view TCP and UDP processing events in a wider system context (Which process did this traffic originate from? Which zone is receiving the most TCP traffic? Which ports are busy?).

In the next entry, I'm going to talk about how we can carry out TCP latency measurements using the tcp provider.

Wednesday Sep 24, 2008

NWAM model (and phase 0.5)

The Network Auto-Magic project, which aims to simplify and automate network configuration needs to do a couple of things:

  • catch, handle (and sometimes generate) network-related events, handling transitioning from one network environment to another etc.
  • provide a data repository for configuration preferences for various network environments, with a set of default preferences that lean towards automatic configuration (e.g. use DHCP to get an address)
  • provide a set of user interfaces to allow for manual configuration as well as inspection of network state

This high-level description maps quite nicely onto the set of core components we plan to deliver for phase 1 of NWAM:

  • nwamd - the core daemon at the heart of network autoconfiguration. It is this daemon's job to handle network-related events, assess network conditions and respond to changes in conditions
  • libnwam is the library that provides storage of NWAM-related configuration
  • Finally commandline tools such as nwamcfg and nwamadm and the GUI component of NWAM allow user interaction with network configuration

In a follow-up set of blog posts, I'm going to try and describe these components, starting with the set of events that nwamd needs to monitor and respond to and how that is (and will be) done.

In the meantime, why not give NWAM phase 0.5 a try - it alleviates many of the usability issues that applied to NWAM and works really nicely. Kudos to Jim and Darren (and Michael for his work on spec'ing out a set of candidate solutions to usability issues).

Monday Sep 22, 2008

OpenSolaris/Windows Vista dual boot configuration

I recently had to make a Windows Vista-installed laptop dual-bootable with OpenSolaris. There's plenty of descriptions of the process around, but just in case here's what I did:

  • Back up any critical data in Windows
  • Burn OpenSolaris 08-05 liveCD
  • Burn system rescue CD with GParted (disk partition management software). It's a Linux-based livecd which allows you to resize partitions dynamically, add new partitions etc. We need to reduce the size of the Windows partition on the disk to make room for a partition for OpenSolaris.
  • Reboot with system rescue cd in drive. When it boots up, type "startx", and when X is running select the GParted icon to run the partition management software. Click on the Vista partition, resize it by clicking and dragging the arrow on the right hand side to the left, and when you've left enough space for OpenSolaris, also add a new partition of type "linux-swap". Resizing/adding the partition should take about an hour. When finished, open a terminal, and type reboot. Video walking through this process can be found here.
  • Boot into windows, allow it to run disk repair, then put the OpenSolaris liveCD in the drive and reboot into OpenSolaris.
  • Click on the install icon on the desktop. When you get to disk partition information, you'll need select the new partition you just added and change the paritition type to solaris from linux-swap before you can install
  • When you reboot (and have removed the liveCD from the drive, you should see Windows and opensolaris in your GRUB menu. Then you can boot into OpenSolaris, run "pkg image-update" etc.

Tuesday Jul 29, 2008

Using Mercurial during the development process

With guidance from Anurag, I've been trying to figure out how the NWAM development process changes with the advent of Mercurial as the source code management tool for OpenSolaris. I've tried to detail the process I'm using here in case it's of use to others. The basic idea is to have a development repository in which we make changes (and will eventually push to the main NWAM repository on opensolaris.org), along with a build repository, to which we pull our changes to build/test. Here are the steps:

  1. clone a development repository
    # hg clone ssh://anon@hg.opensolaris.org/hg/nwam/nwam1 /path2/dev_ws
    
  2. make changes by editing files
  3. commit these changes in the development repository
  4. # hg commit
    
  5. clone/pull the above development changes to a build/test repository
    # hg clone /path2/dev_ws /path2/build_ws
    
    To ensure latest changes are there:
    # cd /path2/build_ws
    # hg update
    
    Or if rather than cloning, you need to update an existing build repository:
    # cd /path2/build_ws
    # hg pull -u /path2/dev_ws
    
  6. clone usr/closed separately, adding it to build repository (external-to-Sun builds need to download the closed binaries at this point instead I believe).
    # cd /path2/build_ws/usr
    # hg clone ssh://elpaso.sfbay//export/clone-hg/usr/closed
    
  7. build/test
  8. push dev changes, replacing yourname with your opensolaris.org account name. SSH keys need to be set up with youro OpenSolaris account for this to work properly, and your account needs to be on the list of contributors.
  9. # cd /path2/dev_ws
    # hg push ssh://yourname@hg.opensolaris.org/hg/nwam/nwam1
    

Friday Jul 25, 2008

SMF snapshots, or why didn't my property change show up in svcprop?

One area of SMF that is a bit confusing is the whole notion of snapshots, refresh and how property updates occur. I find this a bit tricky myself, so if there's mistakes below, I'll happily correct them.

A snapshot is essentially a read-only picture of a set of property groups associated with an instance and its parent service. The most important snapshot is the 'running' snapshot, as it is the place applications (including svcprop) and service methods should read their configuration from. Why is this?

My understanding is that the snapshot model is there (in part at least) to allow atomic commit of multiple changes. This is particularly important for services that consume such configuration information. Imagine I want to change two properties in an inetd service. I'd like to change a tcp service from a wait-type (only handling one connection at a time) to a nowait-type service with a specific maximum connection rate. I need to change two properties - the 'wait' property, and the 'max_con_rate' property, since nonzero values for the latter only makes sense for nowait services. The snapshot model allows me to change each individually with svccfg, then refresh. The refresh action updates the running snapshot to contain a read-only picture of the latest values of my properties (it also runs the refresh method associated with the service instance, if there is one). Then entities consuming these values will see both changes at once as the running snapshot is updated atomically.

Sometimes you'll see the actual service and instance property values referred to as the 'configuration' or 'editing' snapshot, but that's not quite accurate, since as we saw above, snapshots are read-only and we make our changes to the actual configuration values, not a frozen picture of them.

If you select an instance using svccfg, you can see there's a bunch of snapshots in addition to 'running':

# svccfg -s nwam
svc:/network/physical:nwam> listsnap
initial
last-import
previous
running
start
svc:/network/physical:nwam> 

Here we see another aspect of snapshots. svccfg(1M) allows us to revert back to earlier snapshots. The names of the various snapshots are pretty self-explanatory. The idea is, if I mess up my configuration, I just need to revert to the initial snapshot, or the one associated with last manifest import, etc.

You'll note something that seems weird - 'refresh' can be carried out both by svcadm and svccfg. Why is this? Well as we saw earlier, refresh has a configuration component - update the running snapshot - and an action component - run the refresh method. Configuration actions belong in svccfg, and actions in svcadm, so it makes sense to have refresh in both. Also, it's possible you may need to update the running snapshot in a SMF repository that's not currently running - maybe you made a configuration change that makes the system unbootable, and it's in the running snapshot, so you can't fix it with svccfg. Having refresh in svccfg makes such problems fixable, since it can operate on an inactive SMF repository. More on this another time.

So finally, why didn't my svccfg property change show up in svcprop? Because svcprop reads from the running snapshot, and until refresh is run, the changes haven't hit that snapshot yet.

About

user12820842

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today