星期四 3月 31, 2011

To get network statistics

Sometimes we got questions from customers about how to retrieve basic network statistics (such as data like `netstat -sP tcp`) programmatically. The correct way is to use SNMP! But sometimes they just want to get a simple counter, say the number of current established TCP connection (tcpCurrEstab). "Is there a quick and dirty way?" they asked. Yes, there is but use it at your own risk :-)

The global network MIB2 statistics are currently kept using kstat(3KSTAT). For example, `kstat -m udp -c mib2` gives out the following UDP MIB2 counters.

module: udp                             instance: 0     
name:   udp                             class:    mib2
        crtime                          1052755.8976215
        entry6Size                      64
        entrySize                       36
        inDatagrams                     57470
        inErrors                        0
        outDatagrams                    63889
        outErrors                       0
        snaptime                        1882746.61598094

So one can write a small program using kstat(3KSTAT) calls to get standard network MIB2 counters. As an example, the following are the simple steps to get SCTP counters.

1. Use kstat_open(3KSTAT) to get a handle.

2. Use kstat_lookup(3KSTAT) (module is "sctp", instance is 0, name is "sctp") to get the SCTP kstat_t *.

3. Use kstat_read(3KSTAT) to read in the SCTP kstat_t *.

4a. Use kstat_data_lookup(3KSTAT) to get to individual field, say "sctpInClosed". SCTP kstat_t uses KSTAT_TYPE_NAMED type. So the returned value of kstat_data_lookup() is kstat_named_t *. Use data_type in kstat_named_t to find out if the value is a 32 bit or 64 bit integer.

4b. Or use the SCTP kstat_t to get all the fields. The ks_data field in kstat_t is an array of kstat_named_t in this case.

5. Call kstat_close(3KSTAT) to close the handle.

Check all the related man pages for further details on the usage.

星期三 6月 15, 2005

Solaris TCP Window Update

Solaris TCP Window Update
When people check out the TCP source code in OpenSolaris, they may find that some pieces of the code do not follow exactly as specified in the various RFCs.  Here is an example and the reason why Solaris deviates from the RFCs.

On page 72 of RFC 793, the criteria on updating the TCP send window is specified as the following.

          If SND.UNA < SEG.ACK =< SND.NXT, the send window should be
updated. If (SND.WL1 < SEG.SEQ or (SND.WL1 = SEG.SEQ and
SND.WL2 =< SEG.ACK)), set SND.WND <- SEG.WND, set
SND.WL1 <- SEG.SEQ, and set SND.WL2 <- SEG.ACK.

Note that SND.WND is an offset from SND.UNA, that SND.WL1
records the sequence number of the last segment used to update
SND.WND, and that SND.WL2 records the acknowledgment number of
the last segment used to update SND.WND. The check here
prevents using old segments to update the window.

And on page 94 of RFC 1122, the first condition above is corrected to

          Similarly, the window should be updated if: SND.UNA =<
SEG.ACK =< SND.NXT.

In Solaris, we use a different check.  See the following piece of code in usr/src/uts/common/inet/tcp/tcp.c

swnd_update:
/\*
\* The following check is different from most other implementations.
\* For bi-directional transfer, when segments are dropped, the
\* "normal" check will not accept a window update in those
\* retransmitted segemnts. Failing to do that, TCP may send out
\* segments which are outside receiver's window. As TCP accepts
\* the ack in those retransmitted segments, if the window update in
\* the same segment is not accepted, TCP will incorrectly calculates
\* that it can send more segments. This can create a deadlock
\* with the receiver if its window becomes zero.
\*/
if (SEQ_LT(tcp->tcp_swl2, seg_ack) ||
SEQ_LT(tcp->tcp_swl1, seg_seq) ||
(tcp->tcp_swl1 == seg_seq && new_swnd > tcp->tcp_swnd)) {
/\*
\* The criteria for update is:
\*
\* 1. the segment acknowledges some data. Or
\* 2. the segment is new, i.e. it has a higher seq num. Or
\* 3. the segment is not old and the advertised window is
\* larger than the previous advertised window.
\*/

The check

SND.WL1 = SEG.SEQ and SND.WL2 =< SEG.ACK

is modified to be

SND.WL2 < SEG.ACK

Without the change of conditions, a combination of zero window and segment drop can cause a deadlock in TCP.  The reason is that according to the RFCs, TCP does not use window update in out of order segments (retransmitted segments because of drop are out of order), yet the ACK field in those segments is processed.  This can cause a sender A to send more than the other side's (B's) receive window.   This is because the ACK field moves the left edge of the window forward, but as the window update (being 0) in the same segment is not used, TCP will continue to use the old send window which is bigger.  Thus from A's perspective, the whole send window moves forward.  Those out of window segments will be dropped by B.  And once A sends beyond B's receive window, all ACKs from A to B will also be dropped by B because they are out of window (TCP uses the latest sequence number in ACK segments).  In a bi-directional transfer, this means that B will keep on retransmitting its data as those ACKs from A are not acceptable.  This connection will be hung.  Note that this is not a problem in uni-directional transfer.

If a segment (even out of order) passes the normal TCP acceptance test and the ACK field acknowledges new data, it should mean that the window update in the segment must also be used.  Window update and the ACK field are really tied together.  One cannot use the ACK field without also using the window update.  This issue was discussed in the now closed tcp-impl mailing list several years ago.  But AFAIK, there is no write up on this issue.  So there may still be implementations which have this problem handling bi-directional transfer.


Technorati Tag:
Technorati Tag:

星期二 6月 14, 2005

Magic ndd(1M) tunables

Magic ndd(1M) tunables

We often got requests from customers asking the meaning of some IP/TCP/UDP/... ndd(1M) parameters and when to change them.  One common reason is that some of those parameters are thought to be secret magic knobs for improving network performance.  But in reality, nearly all of those parameters are not supposed to be changed at all.  They are there just in case of abnormal situations.  Now that OpenSolaris is available, the truth of the use of those parameters is finally revealed!  People can just look at the code and comments!  I'll describe one  TCP ndd(1M) parameter, tcp_use_smss_as_mss_opt, added in Solaris 10 as an example to illustrate this.

You can find the following piece of code in the usr/src/uts/common/inet/tcp/tcp.c file (you need to scroll down a little to find it).

                        case TCPS_SYN_RCVD:
flags |= TH_SYN;

/\*
\* Reset the MSS option value to be SMSS
\* We should probably add back the bytes
\* for timestamp option and IPsec. We
\* don't do that as this is a workaround
\* for broken middle boxes/end hosts, it
\* is better for us to be more cautious.
\* They may not take these things into
\* account in their SMSS calculation. Thus
\* the peer's calculated SMSS may be smaller
\* than what it can be. This should be OK.
\*/
if (tcp_use_smss_as_mss_opt) {
u1 = tcp->tcp_mss;
U16_TO_BE16(u1, wptr);
}

The code above is executed when a TCP end point is in SYN-RECEIVED state (defined as TCPS_SYNC_RCVD in the code) and it is composing the SYN/ACK segment in response to an incoming SYN segment.  The variable wptr in the above code is a pointer to where the TCP MSS (Maximum Segment Size) option value in the SYN/ACK is.  So if the ndd parameter tcp_use_smss_as_mss_opt is set to a non zero value, the TCP MSS option value will be set to tcp->tcp_mss.  The field tcp_mss in the tcp_t structure is the actual sending MSS size, which is calculated using the TCP MSS option value in the received SYN segment, the length of additional TCP options in each segment, the outgoing networking interface's MTU size and possibly IPsec header overhead.  The default value of tcp_use_smss_as_mss_opt is 0.  So why does one want to use the sending MSS size as the advertised TCP MSS option value?  The MSS option value is supposed to mean the maximum segment size the local TCP end point can receive, not send.

The reason is briefly described in the comments above the code.  We introduced this parameter to get around some broken middle boxes.  Without this parameter, the local TCP end point uses the outgoing network interface's MTU size to calculate the TCP MSS option value since how big a segment the local TCP end point can receive is determined by the MTU size of the network interface.  If the network interface is a normal Ethernet card, the MSS option value is 1460.  Note that this value is independent of the MSS option value in the SYN segment advertised by the other side of a connection.  The other side should also use the same method to calculate the TCP MSS option value.  Both sides of a connection then calculate the correct send MSS size based on the above fact.  Everything should work as expected...

The problem comes when there is a broken middle box, such as a DSL modem/router using PPPoE.  Suppose machine A has a normal Ethernet interface but it is connected to the Internet using DSL with PPPoE.  A's TCP stack may not know about the reduced MTU size because of PPPoE.  This is usually not a problem as path MTU discovery can handle this issue.  But if the DSL modem/router is broken and does funny things, we got into trouble.  One funny thing is that it may modify the MSS option value A sends out.  The value A sends out should be 1460, but the modem/router can reset it to a lower value based on the PPPoE overhead.  By doing this, the modem/router thinks that it helps solve the path MTU issue.  Thus it can now forget about the need of path MTU discovery and not send any ICMP messages required for path MTU discovery to work. 

Suppose A is trying to talk to machine B, which also uses Ethernet interface.  The modem/router changes A's TCP MSS option value to X but it does not change the MSS option value (should be 1460) in B's SYN/ACK.  While B will not send a segment larger than X bytes to A, A can send a full 1460 bytes segment to B.  These full 1460 bytes segments to B will be dropped by the modem/router.  And since the modem/router does not participate in path MTU discovery, A's TCP stack will never know about the problem and the connection will just hang.

We have some customers experiencing exactly this problem with their clients who are behind such broken middle boxes.  While their clients can connect to our customers' servers, the clients cannot do any transactions as data sent to those servers are dropped by those middle boxes.  One work around is to lower the MTU size of our customers' servers.  Then the calculated TCP MSS option value will also be smaller.  This is not optimal as not all of their clients are behind such broken middle boxes.  For those clients not behind such broken middle boxes, this work around will reduce the network performance to our customers' servers as they cannot send full size segments. 

We introduced the tcp_use_smss_as_mss_opt parameter to work around this problem.  In the above case, if the tunable is set to 1, Solaris TCP (as B) will use X as the TCP MSS option value.  Then A will only send segments as large as X to B.  And for those clients not behind such broken middle boxes, they can send full size segments.

If there is no such broken middle box, there is really no need to have tcp_use_smss_as_mss_opt parameter...  There are other ndd(1M) parameters which were introduced for similar unusual circumstances.  They are not secret magic knobs.



Technorati Tag:
Technorati Tag:

About

kcpoon

Search

Categories
Archives
« 四月 2014
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today