OUTLINE (as discussed at IETF 4/89) x's for text included below Connection State, Establishment & Closing Connection Establishment Y x Maximum Segment Size M TOS N Precedence M x Management of TCBs Y x Connection Reuse Y Send Side Slow-start Congestion Avoidance Y Sender Silly Window Avoidance Y Nagle Small Packet Avoidance Y Relationship of Nagle & S-S Y x PUSH Y x RTT Estimation & RTO Calculation Y Rxt Amount Y x Rxt Backoff Y Response to Source Quench Y Zero-window & SWS Cant-Send State M Receive Side x Out-of-order Processing Y Window Upper Limits M Receiver Silly Window Avoidance Y x Delayed ACK Y Piggy-back ACKs M Interfaces and Other Application-TCP Interface Y Fairness Among Connections Y TCP-IP Interface M Fair Processor Y Off-board Issue M Connection Instrumentation Y Extended Window Option Y High-Speed Implementation Techniques M TEXT (as of 5/10/89) MAXIMUM SEGMENT SIZE During a data transfer, the length of TCP segments has a decided effect on performance. The TCP connection parameter that determines segment length is "MSS," for "Maximum Segment Size." Poor performance can result from an MSS that is either too large or too small. A segment is a unit of data that is transferred between TCP modules. In other words, it is the data portion of a TCP PDU. It is segments that are acknowledged or retransmitted by TCP modules. The motivation for using large segments is to minimize the relative amount of header overhead and packet processing during a data transfer. Up to a point, the larger the segments, the fewer IP datagrams a data transfer will require. Clearly, transmission bandwidth and memory is saved if fewer headers are required. Furthermore, the processing needed for a data transfer are roughly proportional to the number of packets, regardless of packet length. The efficiency gains from large MSS values end, however, as soon as segments are so large that IP must fragment them. If an IP datagram exceeds the MTU length of a network it must traverse -- if it is too long for a network interface -- then the IP module will fragment the datagram. Each fragment contains the IP header, along with an offset, and a flag to indicate whether a particular fragment is the last. Further along, if they must traverse networks with even smaller MTUs, then the fragments themselves may undergo fragmentation. Fragments are reassembled by the IP module at their destination. Hosts running IP must have the ability to accept or reassemble datagrams with MTUs of 576 bytes. [Host Requirements RFC, Section 3.3.3]. Because fragmentation results in the generation of multiple IP headers and causes additional processing, it immediately wipes out the advantages of using a large MSS. The most detrimental effect on performance, however, occurs at the destination. Fragments are not delivered piecemeal: IP must successfully reassemble an entire datagram before delivering it to an upper-layer protocol. If a single fragment is lost, then the remaining datagram fragments will eventually be dropped, and all resources that were used for their transmission were wasted. The ill effects of IP fragmentation are compounded by the fact that there are some gateway implementations that simply cannot perform correct datagram fragmentation. For a comprehensive analysis of the effect of IP fragmentation on performance, see Mogul's article, "Fragmentation Considered Harmful" [SIGCOM citation]. Given the above considerations, the optimum MSS for a connection will be the maximum MSS that does not cause fragmentation.* This value is a function of the source, destination, and type of service used by the connection. As a result of internet connectivity changes, the optimum MSS value may change over time. The Host Requirements RFC requires that hosts implement the MSS option, and recommends that an implementation use it during connection establishment if its receive MSS differs from 536 (Section 4.2.2.6). The value of the MSS used during data transfers is then computed from the formula: Eff.snd.MSS = min(SendMSS + 20, mts) - TCPhdrsize - IPoptionsize), where "SendMSS" is the MSS received from remote host during connection establishment (assumed to be 536 if the remote peer does not use the MSS option), and "mts" is the maximum length of a transport message that the local IP will send. The value for mts is obtained from the local IP service interface. An earlier approach to setting MSS was to use large values only if there was special knowledge that no fragmentation would result (e.g., if a the destination were on the same ethernet), but otherwise to use an MSS of 536. The algorithm given in the Host Requirements RFC is based on the tenet that MSS discovery is best handled by IP. ____________________ *It is a common practice to round the MSS down to a convenient block size. For example, an MSS of 536 is frequently approximated by 512. CONNECTION BLOCK MANAGEMENT The performance of a TCP implementation may be affected by the methods used to manage the connection state blocks, also known as Transmission Control Blocks or TCBs, associated with each TCP connection. This will be especially true on systems which have many connections in progress simultaneously. When a TCP segment arrives, it is necessary to asso- ciate it with some TCB in order to decide what to do next. A TCP connection is uniquely identified by the 4-tuple {local address, local port, foreign address, foreign port} as described in RFC 793, section 2.7. For active connec- tions, there will be exactly one TCB with matching values. Thus, one part of the processing of every incoming segment is to extract the address and port information and attempt to locate the corresponding TCB. The techniques used to perform this matching operation should be carefully chosen for efficiency, particularly for the most common case. Special considerations may be required to deal with TCBs which exist in an inactive state. Passive OPENs are one such case. These issues will be discussed separately later. The simplest way of organizing the TCBs is as an unordered linear array or linked list. For small-system implementations of TCP, this may be the best approach. For a TCP which is supporting numerous connections, the time required to search through an unordered list may have a noticeable impact on processing overhead, and perhaps even on attainable throughput. Also, this approach tends to make it unwieldy to support multiple, possibly overlap- ping partially-specified passive OPENs in a useful way. An alternative approach is to maintain a hash table with pointers to multiple linked lists. This might be a good strategy for moderate numbers of connections. It does raise the question of what key value and hash func- tion are desirable. One possibility is to use the hash function: H = LP mod N where LP is the local TCP port number, N is the size of the hash table, and H is the derived index into the hash table. N could be chosen as a power of 2 to allow the hash function to be implemented with a bitwise masking instruction. (Note that 8 is the smallest power of 2 which will provide a unique hash for the assigned well known ports for TELNET, FTP, SMTP, and DOMAIN services. This is by chance, not by design.) It would seem preferable to use something more than the local port number as input to the hash function, espe- cially if many connections are expected to be made to the same local port (e.g., a TELNET terminal server). How- ever, the following facts should be considered: (1) Hashing on the local address seems relatively unprofitable since the majority of systems are not multiply homed and thus have only one local address. (2) Hashing on local address, foreign address, or foreign port may lead to complications in handling TCBs created by partially-specified passive OPENs. Since they have not yet been bound to a foreign address and port, several TCBs may potentially match an incoming SYN segment. An implementation should be structured to find the closest match (the one with the fewest unspecified fields). If all partially- specified TCBs for a given local port number are on a single linked list, this structuring can be achieved by inserting new TCBs at the correct position in the list. Another way to speed up this matching operation is to arrange the list(s) of TCBs to cause the most frequently referenced TCBs to be checked first. Again, many varia- tions on this strategy are possible. The approach taken in the Berkeley networking code is to install each new active connection at the head of the list. Since the Berkeley code uses the template/clone approach to imple- menting passive OPENs, the template passive OPENs for com- mon servers are typically created shortly after the system is booted, and thus their TCB's are at the end of the list. When one of these TCBs is activated in response to an incoming SYN, a clone TCB is created and placed at the head of the list. TIMEWAIT AND CONNECTION REUSE In order to protect against old segments being inter- preted as part of the traffic stream of a new connection, the TCP protocol imposes certain rules and procedures for segment handling and interpretation. The key elements of this mechanism are (1) a bound on maximum segment life- time, and (2) the "quiet time" concept. Associated with (1) is the requirement for appropriately initializing, decrementing and testing the Time-to-live (TTL) field of the IP datagram header. When both the TCP and IP speifi- cations are strictly followed, the TTL serves to bound the lifetime of a TCP segment traveling through the internet. RFC 793 specifies MSL as two minutes; this is noted as being a somewhat arbitrary choice. The other half of this integrity guard consists of each TCP's ensuring that it does not reuse the connection identifying information (source and destination IP addresses and TCP ports) for a connection for a period of time after it ceases to be used. The description of the protocol state machine for TCP identifies a state called "TIME WAIT" which is entered when a connection has been closed by the peer TCPs. If each TCP keeps connections in TIME WAIT state for at least the maximum segment lifetime, there should be no remaining old segments from that con- nection anywhere in the internetwork when the TIME WAIT state is finally exited. The implementation of TIME WAIT state requires main- taining a subset of the connection state information so that if any old segments do arrive, they can be identified as such and discarded. An inconvenience associated with TIME WAIT state is the need to store this information for what in some contexts may be considered a long period of time (two minutes). If TCP is to be used in some high- speed environment with many very brief connections being formed, used and closed, there may be a large number of connections in TIME WAIT state. This could be a perfor- mance bottleneck in two respects: first, it results in more connection blocks to compare incoming segments with; second, these connection blocks may collectively consume a substantial amount of memory space. The seriousness of these concerns is very much dependent on the context in which a particular implementation of TCP is being used in practice. Some implementors do not wish to tie up the port numbers for the length of time associated with TIME WAIT. Some operating systems place restrictions on the range of port numbers available to certain users; a problem could arise if the limit allows a user fewer ports than they plan to use in a time interval equal to MSL. (Although each end of the connection can assign its port numbers separately, there is an effective limit to the utility of this, since the connection is typically being used to con- nect to a predefined port number at the destination which is assigned to the particular service or application being used, e.g., TELNET or DOMAIN.) Speed limitations of net- works have largely protected users from this class of problem in the past, but increased network performance and new classes of applications may make it an issue. In order to partially circumvent the limitation, some implementors have suggested that the sequence number information of a closing connection be used to interpret any segments arriving for a connection in TIME WAIT state. If the incoming sequence number is in the half of the number space "above" the last sequence number of the TIME WAIT connection, it would be taken as referring to a dif- ferent connection. This relies on the amount of time needed for the 32-bit sequence numbers to wrap around, as a substitute for a strict TIME WAIT. At present, we prob- ably do not have any networks fast enough to cause this assumption to break down; at 100 megabits per second (FDDI), the sequence space takes between 5 and 6 minutes to wrap around fully, and thus allowing half of the sequence space to be used as the safety factor leaves a remaining padding of more than the MSL. If TCP connec- tions can be made to operate at even faster rates, this method may break down. Another way to reduce the overhead cost of TIME WAIT connections is to reduce the amount of space used to retain their state. For a strict implementation of TIME WAIT, it is only necessary to retain the address and port information. This could afford some reduction in memory consumption for implementations where this is a concern. PUSH BIT The use of the "push" function in TCP may be somewhat v obscure to the novice. The use and non-use of PUSH in various situations has an impact on the properties of the communication path as perceived by the end user. However, a TCP implementation can and should be designed to cause PUSH to have only specific, intentional impacts. The fol- lowing discussion reviews the definition of PUSH and the implementation characteristics which are desirable to obtain the best performance. First of all, it should be pointed out that the PUSH bit is an end-to-end control function operating on both the sending and receiving TCPs. When the PUSH bit is not set, a TCP is permitted to buffer data for an indefinite period of time for efficiency reasons. When the PUSH bit is set, indefinite buffering is not permitted, and the TCP is required to export currently buffered data to the appropriate interface at the first opportunity permitted. SENDING PROCESS'S VIEW OF PUSH It is essential, as a matter of correctness, for the PUSH function to be used at every point in communication where one party intends to cease transmission and wait for the other party to act. Failure to use PUSH in these cases creates a substantial likelihood of communications deadlock due to buffered data remaining undelivered. It is desirable to not use the PUSH function at any other time. The overall performance of communication using TCP will be optimized by avoiding the PUSH function in those cases where buffering decisions may be safely left to the discretion of the TCP implementations at both ends of the connection. A TCP is not required to provide a selectable PUSH function to the user process which performs the SEND func- tion. When it is not provided, every SEND operation implicitly selects the PUSH function. This rule is needed to avoid the deadlock scenario previously mentioned. A user process operating with such a TCP should be designed with appropriate consideration for the impact of this implicit PUSH. Even if the sending TCP implements a seg- ment producing algorithm which effectively combines the SENDs, it may result in repeated data copying or other inefficiencies which impact the ultimate performance. Also, the production of many small packets may lead to undesirable overhead in I/O processing on either the sending or the receiving hardware. (If the hardware and operating system give unconditional priority to I/O pro- cessing, the burden is borne by user processes. The TCP congestion control mechanisms do not detect this type of burden until it becomes so large that the user processing backlog propagates back to the I/O environment, at which point the system may be completely useless from the viewpoint of an end user.) In such situations, it may be desirable to implement some buffering in the user process so that less overhead is consumed by SENDs. SENDING TCP'S HANDLING OF PUSH In the case of a sending TCP, the use of the PUSH function implies that the outgoing data on hand should be made available to the segment-producing code, and any resulting segments should be transmitted via the IP layer interface at the first opportunity permitted by the con- straints of available window space, the Nagle algorithm, and the congestion control algorithm. (Note that the PUSH function conceptually overrides the sender portion of silly window avoidance.) If the sending TCP does not allow the user process to specify whether to PUSH, it must cause the segment production mechanism to operate each time the user process executes the SEND function, and set the PUSH bit on the last segment produced as a result. If a sending TCP produces segments asynchronously with respect to the transmission via the IP layer, a situation will sometimes exist in which there are multiple segments queued for transmission via IP. In this case, it is permitted and desirable for the sending TCP to combine segments before transmission in certain situations (such as when retransmission is occurring). The sending TCP is NOT required to send each PUSHed segment separately. When segments are combined, any resulting segment should have its PUSH bit set if it contains the last octet of any former PUSHed segment. If a sending TCP produces segments synchronously with respect to the transmission via the IP layer (as with the Berkeley networking code), it is adequate to remember the first and last PUSHed sequence numbers, and set the PUSH bit on all created segments containing any portion of that range of sequence numbers. <> RECEIVING TCP'S HANDLING OF PUSH When a receiving TCP receives a segment with the PUSH bit set, it should transfer any data on hand to the receiving process as soon as it is possible to do so. This does not require that each PUSHed segment be delivered in response to one distinct RECV call by the user process. The best implementation of PUSH handling in the receiving TCP will depend somewhat on the buffering strategy used between the TCP and the user process. It should also take into account (to the extent practical) the status of the user process with respect to the connec- tion; in other words, if the user process is blocked because it has no data to process, the TCP should attempt to give it some data fairly quickly and enable the user process to run. If multiple data segments arrive at effectively the same time, there is some question of how best to deal with a PUSH in an earlier segment. It is desirable to start the user process computing quickly, so that the data will begin to be put to use and application throughput will arguably be maximized. However, it may be more efficient in certain systems to transfer data from as many incoming segments as possible into the hands of the user process before causing it to run. The designer should consider the overall architecture and performance characteristics of the computing environment and choose an approach that minimizes the amount of time that is dissipated unproduc- tively. RECEIVING PROCESS'S VIEW OF PUSH The availability of a PUSH indication to the receiv- ing process is an optional feature of TCP implementations. If available, it may allow some performance optimizations in the application. If an asynchronous "interrupt" or "signal" capability exists, this may be a desirable means of indicating the availability of PUSHed data, particu- larly if some type of shared memory buffer is used to pass data from the TCP to the user process. If the interface is essentially synchronous, the PUSH indication will likely be a flag or return value of the RECV function. The PUSH indication is of limited practical utility; perhaps the most likely useful situation is that of a user process which itself performs communications functions. Such a process might use the PUSH indication as an indica- tion to select the corresponding functionality when it outputs the data. RTT ESTIMATION AND RTO CALCULATION In normal operation, TCP implementations use estimates of Round Trip Time (RTT: the elapsed time between the transmission of segments and the receipt of their acknowledgements) as the basis for calculating the value for Retransmission Timeout (RTO: the maximum time a TCP implementation will await acknowledgement without retransmitting a segment). The RTO value is calculated dynamically from RTT estimates so that TCP can operate over a wide range of network hardware and under varying load conditions. For acceptable TCP performance, the RTO must be neither too low nor too high. If a TCP implementation consistently uses too large an RTO, then it will be slow in detecting dropped segments. Dropped packets will cause the TCP peers to become inactive, as the sending TCP waits for RTO to expire and the receiving TCP waits for a sequence number it can acknowledge. On the other hand, too low an RTO value will result in spurious retransmissions, since the sending TCP will misinterpret delays as loss. This is clearly wasteful, and can cause congestion. The TCP specification includes an algorithm "given...as illustration," that calculates RTO based on an RTT estimate. Though the algorithm is widely used in TCP implementations, it has known flaws, and has proven to be inadequate. The two major problems with this RTO algorithm are: 1. It does not address the issue of sampling RTTs in the event of segment retransmission; and 2. It cannot cope with the high variation in RTT that typifies highly utilized systems. Phil Karn has developed an algorithm that solves the first of the above problems (Karn & Partridge, 87), and Van Jacobson and Mike Karels have developed an algorithm that solves the second (Jacobson, 88). Experience has shown that using Jacobson's algorithm to compute RTO based on RTT samples, and Karn's algorithm to select RTT samples, can dramatically improve TCP performance. As a result, the Host Requirements RFC mandates the use of these algorithms (Section 4.2.3.1). The RTT sampling problem that arises when segments are retransmitted is that it is impossible for the sending TCP to determine which transmission is being acknowledged. In other words, there is no way to tell whether the RTT should be measured from the initial segment transmission, or from the retransmission. Use of either approach can seriously corrupt the RTT estimation. On the one hand, if RTT is always measured from the most recent retransmission, then implementation will severely underestimate RTT exactly when increases in delay cause unnecessary retransmissions. It is even possible for an implementation using this approach to enter a stable state in which it continually underestimates RTT, and as a result transmits every segment several times. On the other hand, if the more conservative approach is used, then RTT will be overestimated for networks that drop packets. Because exponential RTO increases are mandated for successive retransmissions (see section XX, "RTO Backoff"), the effect of RTT overestimation can accumulate. In sufficiently lossy networks (e.g., if about 1 in three packets are lost, and the losses are independent of each other), RTO will tend to increase to and remain at its upper bound. The Karn algorithm solves the problem of acknowledgement ambiguity by discarding all RTT measurements that are tainted by a retransmission. The algorithm also stipulates that RTO will not be calculated from RTT estimates until legitimate RTT samples are available. If retransmissions occur, then "backed off" RTO values are retained for use by the next segment (see section XX, "RTO Backoff"). Clamping the RTO values in this fashion prevents TCP from injecting excess traffic as a result of a poor RTT estimate. Jacobson's algorithm addresses problems in RTO calculation. Originally, RTO was calculated as: RTO = beta * SRTT, where beta is a constant between 1.3 and 2, and SRTT is the "Smoothed Round Trip Time." SRTT is obtained by the formula: SRTT(n+1) = alpha*SRTT(n) + (1 - alpha)RTT(n) where alpha is a constant between 0.8 and 0.9, and RTT(n) is the nth RTT measurement. This approach for calculating an average is known as "exponential smoothing." It is a weighted average: the most recent observations are given the most weight, while the impact of earlier observations decreases as the algorithm is iterated. The problem with the original RTO calculation is that a fixed multiplier is used to cope with the variance of round trip delay. This approach is inadequate for the Internet; it is well known that RTT varies tremendously. This is not surprising: from queuing theory, we expect that systems under heavy load will display large variance in delay. Increasing a fixed beta sufficiently to accommodate the range of delay variance would result in an RTO much too inflated. In the Jacobson algorithm, RTO is calculated as: RTO = SRTT + 2 * MDEV, where MDEV is the average absolute error -- the "mean deviation" -- of SRTT. Jacobson chose mean deviation rather than standard deviation because it is much simpler to calculate. As he points out, however, the mean deviation is itself a rather good estimator for standard deviation. A rough summary of the Jacobson algorithm is that delays as great as roughly two standard deviations above average will be accepted without retransmission. The most important feature of this algorithm is that RTO is dynamically tailored for the sampled RTT variance. In Jacobson's algorithm, the estimate for MDEV is derived from second-order exponential smoothing, as follows: Err = | RTT - SRTT | SErr(n+1) = gamma*SErr(n) + (1 - gamma)Err where Serr(n) is the nth "smoothed error." Because it is desirable to have RTO adapt quickly when network load and delay increases, the value for gamma should probably be lower than the value used for alpha. Jacobson suggests 0.875. In TCP implementations, SRTT and SErr are recalculated often. Hence, the calculation should be efficient. In an appendix of "Congestion Control and Avoidance," Jacobson presents an integer arithmetic implementation of his algorithm that requires only 6 adds, three shifts, and a compare, to get values for SRTT, SErr, and RTO. The procedures for calculating SRTT and SErr are recursive, as each new SRTT and SErr is a function of the previously calculated SRTTs or SErrs, respectively. For the first calculation, however, there are no previously calculated values to use. Hence, these algorithms need seeds for the first calculation. The Host Requirements RFC encourages an initial value of 0 for RTT and 3 for RTO. It also recommends that the Jacobson algorithm begin operation with the SYN segments exchanged during the 3-way handshake of a TCP open (Section 4.2.3.1). RETRANSMISSION BACKOFF The interval between successive retransmissions of a TCP segment must increase, or "back off," exponentially. An appropriate rate of increase is for the retransmission timeout (RTO) to double for each transmission of the same segment. If an internet's TCP implementations do not employ exponential backoff, then it will be vulnerable to congestive collapse. For this reason, the Host Requirements RFC lists exponential backoff as a TCP requirement (Section 4.2.3.1). TCP retransmissions occur either because datagrams are lost, or the actual round trip time of segment transmission and its acknowledgement exceeds RTO. In either event, the root cause is, in most cases, congestion. Hence, TCP retransmissions often have the effect of adding load to a system that is already past saturation. Work is in progress on a proof that an exponential RTO increase is a requirement for stable, collapse-free networks. An informal argument to this effect is presented in "Congestion Avoidance and Control" (SIGCOM 88), by Van Jacobson. Karn's algorithm for selecting roundtrip time (RTT) samples is also required by the Host Requirements RFC (Section 4.2.3.1). In general, the RTO is computed as a function of estimated RTT. In the Karn algorithm, however, "backed off" values of RTO are retained until a segment is acknowledged without being retransmitted. The motivation for this algorithm is similar to the motivation for exponential backoff: because persisting traffic tie-ups can result from adding traffic to a congested system, TCP implementations should use conservative RTO values until they can obtain sample RTTs that are uncorrupted by retransmissions. The Karn algorithm is readily integrated with the RTT estimation procedures described in Section XX. OUT OF ORDER SEGMENT HANDLING The TCP specification (RFC 793) indicates that TCP implementations may choose whether or not to support the buffering of out-of-order segments. In this context, the term "out of order" refers specifically to a segment arriving at a receiving TCP which does not contain the data octet or control function corresponding to the next sequence number expected by the receiving TCP, but is oth- erwise an acceptable segment. In the terminology of RFC 793, such a segment has SEG.SEQ > RCV.NXT. It is recommended that a TCP implementation endeavor to buffer out of order segments to the extent possible. RFC 793 notes that not buffering these segments has an unfavorable impact on performance; the remainder of this section explains the extent of the impact. It seems obvious that a sending TCP will generally transmit segments in sequence number order. Therefore, when an out of order segment arrives at a destination TCP, some type of interesting event has occurred somewhere in the internetwork components between the sender and receiver. There seem to be two primary classes of such events. First, the loss of a segment in the network will cause any subsequent segments (those with higher SEG.SEQ) to appear to be "out of order". In the second class of event, characteristics of the intervening network lead to the reordering of segments. These cases will be con- sidered separately. LOST SEGMENT CASE Segments may be effectively lost between the sender and receiver in several ways. They may be deliberately dropped by gateways due to congestion, discarded by gate- ways or hosts due to data corruption resulting in IP or TCP checksum errors, misdelivered due to anomalies in routing, or they may just vanish due to some type of hardware or software failure. When a segment is lost for any of these reasons, sub- sequent segments in transit may continue to be delivered successfully. In TCP, the recovery of the lost segment is accomplished by timer-driven retransmission from the send- ing TCP. Buffering the out of order segments makes it unnecessary for subsequent segments to be retransmitted if they arrived at the destination TCP intact. This strategy fits together with the requirement for the TCP retransmis- sion algorithm to cause only one segment to be retransmit- ted on expiration of the timer. When the timer associated with the lost segment expires, that segment (or one con- taining a superset of its contents) will be retransmitted and the retransmission timer will be restarted with a longer duration. If this retransmitted segment arrives successfully at the destination, the destination TCP will be able to process not only that segment, but also the buffered segments which were received "out of order". The acknowledgement then sent will cover all of these seg- ments, and the sending TCP will not retransmit the other data which was sent previously. On the other hand, if the "out of order" segments were not buffered at the destination, the data contained in them will need to be retransmitted as a result of the one preceding segment being lost. This retransmission may require several packet transit times, since the slow-start algorithm will be employed to reestablish the rate of flow. (The use of slow-start is necessary here because the sending TCP cannot distinguish this set of events from a congestion-induced packet loss. In fact, the original lost packet may have been lost due to congestion, or due to other types of channel errors for which slow-start is not logically necessary.) [WWB: Not clear to me here whether there is something clever about VJ code to take advantage of an ACK for a huge amount of data to make the congestion window stuff not slow-start, or if this is a possible enhancement, or if it is not doable. Requires thought.] NETWORK REORDERING CASE Reordering of segments within a network or internet- work is a property which may seem to be inherently incom- patible with sensible implementations of network inter- faces. However, when multiple physical or logical paths are used to carry the segments of a single TCP connection, the possibility of reordering exists. This situation might occur as a transient event due to path changes as a result of load balancing. It can also occur on a continu- ing basis when multiple physical or logical links are used to carry the segments of a single TCP connection. For example, several implementations of IP over X.25 networks will open multiple virtual circuits in order to circumvent the limited window size supported by a particular network's X.25 implementation. Or, there may be several relatively low-speed links which are equally good paths to the destination, and they may be used in a round-robin fashion to attempt to gain better throughput. In any of these situations, the possibility exists of reordering (as viewed by the destination) due to varying transit times on different links. The transit time can vary due to internal fluctuations in loading, or it may differ simply because consecutive segments vary in size. For example, if a sending TCP transmits a large segment followed by a small segment and each of these segments travels through a different low-speed link, the small segment may well "arrive" first. IMPLEMENTATION CONSIDERATIONS A distinction should be understood between out of order segments which are (at least partially) within the window, and any other out of order segments. The latter are, strictly speaking, "not acceptable" segments accord- ing to RFC 793. However, some implementations may wish to retain some such segments anyway. For example, if a receiving TCP ever "shrinks its window" by decreasing the quantity RCV.NXT+RCV.WND, there may be segments received which could be construed as potentially acceptable. Buffering out of order segments clearly consumes additional memory resources in a TCP implementation. For many implementations, the space required is not large relative to the overall space available. However, there will probably be some cases where buffer space is a scarce resource. Several implementation strategies may be useful in such cases to try to achieve the benefits of out of order segment buffering without leading to unacceptable resource impacts. It is plausible that a TCP may purge out of order segments according to some pragmatic rule, such as only allowing some quantity X of out of order data to be buf- fered. In such a case, the segments buffered should be those closer to the left edge of the window. When incoming segments tend to be small, as in the case of certain TELNET sessions, economies in space utili- zation may be achieved by combining adjacent out-of-order segments into larger segments or buffers. There is a tradeoff between memory and processor utilization in this situation. It is plausible that a TCP might have a space recovery routine which is called when memory resources are found to be running low. This routine might discard out- of-order data as part of its space recovery strategy. It is probably desirable to transmit an acknowledge- ment for out of order segments using the same rules as for other segments, whether the out of order segment is buf- fered or not. Note that this does not mean that the ack- nowledgement sequence number is increased; it will typi- cally repeat the acknowledgement sequence number of a pre- vious segment. Although a redundant acknowledgement may seem useless, it actually assists in maintaining throughput under some conditions. In particular, if the sending TCP implements Karn's "fast retransmit" algorithm, the redundant acknowledgement would be recognized as indi- cating a missing segment, and would induce an immediate retransmission of the next segment. DELAYED ACKNOWLEDGEMENTS The TCP protocol provides an acknowledgement sequence number field in every segment. When the ACK flag in the TCP header is set to 1, the value in this field (known as SEG.ACK in RFC 793) is significant, and indicates that all data octets or control flags corresponding to incoming sequence numbers through the number specified have been received by the acknowledging TCP. This definition leaves open the question of when acknowledgements are sent. Since the field is present in every packet, it is obvi- ously useful to set the appropriate value in any outgoing data-carrying packet which happens to be created. How- ever, if the flow of data on a TCP connection is irregular or completely unidirectional for a substantial period of time, it is an implementation choice as to when an acknowledgement-only segment is generated. A wide range of acknowledgement strategies can be imagined. At one extreme, an implementation might transmit a separate acknowledgement for every segment it receives. The opposite extreme would be to send an ack- nowledgement only when the window has been filled. (Note that delaying acknowledgement any longer than this would create a deadlock.) Any strategy within this range has certain potential benefits and drawbacks from a perfor- mance perspective. Most obviously, the transmission of an ACK-only seg- ment uses a relatively large amount of network resources in relation to the information conveyed. Other drawbacks of immediate acknowledgement for each segment include the possibility that the local user's processing of the incom- ing data may quickly create a need to generate another outgoing segment to update the offered window or to send some user data in response. In either of these cases, it would be better to combine the acknowledgement, window update and response data into a single segment. A notable example of this type of connection activity is the support of TELNET for character-asynchronous remote login sessions with remote echoing. The inherent problem with delayed acknowledgement is that it is not generally possible for the TCP to know for certain what is likely to occur (or how long it will take to occur) when incoming data is turned over to a process. The inevitable compromise is to delay sending the ack- nowledgement for some period of time, in the hope that some outgoing segment will be created for one reason or another. If one is created, then the acknowledgement is "piggybacked" on that segment. If no such segment is created within some interval, an ACK-only segment is created and transmitted. A strategy of deferring acknowledgements has a less obvious but potentially significant performance impact: it destabilizes the other TCP's estimated round-trip time for the connection, and thus may cause inappropriate retransmissions or delay appropriate ones. In either case, connection throughput will be reduced. Since exces- sive retransmissions also contribute to global congestion, it is particularly important to avoid this condition. It seems intuitively obvious that some intermediate strategy should provide optimal results. It is unclear whether any one strategy is close to optimal for a wide range of operating environments. Factors influencing the optimal strategy include the window size potentially available, the residual network layer error rate experi- enced by the connection, the minimum RTT, the mean RTT, the variability of the RTT, timing considerations in the local TCP, and the retransmission-related parameters and algorithms of the other TCP. If the segment transit times in both directions are (on average) the same, half of RTT will have been used up at the first possible moment for sending an ACK. Since the other half of RTT is presumably required for transit of the ACK, the safe delay period is proportional to the variability-based component in the other TCP's RTO calcu- lation. The current recommendation is to acknowledge at least every other segment received, and not to delay an ack- nowledgement by over 0.5 second. This is a pragmatic engineering compromise, not a demonstrably optimal solu- tion. It permits up to 50% of ACK-only segments to be eliminated, thus presumably allowing a suitable implemen- tation to achieve much of the benefit obtainable from a reduction in overhead transmissions. <> Acknowledging at least every second segment (or any similar constraint) accomplishes three things: (1) It adapts the delay interval to the speed of the connection for fast connections (which is to say, for connections which pass more than four segments per second in one direction). (2) It provides for transmission of redundant ACKs in the case of a dropped segment, allowing the other TCP to effectively use the fast retransmit algorithm if implemented. (3) It reduces the likelihood of an unnecessary suspension of transmission by the other TCP when it applies the Nagle algorithm. The 0.5 second limit (or another absolute limit) accomplishes two things: (1) It clamps the delay for slow connections to avoid creating another source of delay and thus reducing throughput more than necessary. (2) It sets a bound on the perturbation in RTT vari- ance observed by the other TCP (so that RTO for high-delay connections doesn't grow excessively and impede any legitimate retransmissions).