OUTLINE (as discussed at IETF 4/89)
  x's for text included below

Connection State, Establishment & Closing

  Connection Establishment               Y
x Maximum Segment Size                   M
  TOS                                    N
  Precedence				 M
x Management of TCBs                     Y
x Connection Reuse                       Y

Send Side 

  Slow-start Congestion Avoidance        Y
  Sender Silly Window Avoidance          Y
  Nagle Small Packet Avoidance           Y
  Relationship of Nagle & S-S            Y
x PUSH                                   Y
x RTT Estimation & RTO Calculation       Y
  Rxt Amount                             Y
x Rxt Backoff                            Y
  Response to Source Quench              Y
  Zero-window & SWS Cant-Send State      M

Receive Side

x Out-of-order Processing                Y
  Window Upper Limits                    M
  Receiver Silly Window Avoidance        Y
x Delayed ACK                            Y
  Piggy-back ACKs                        M

Interfaces and Other

  Application-TCP Interface              Y
  Fairness Among Connections             Y
  TCP-IP Interface                       M
  Fair Processor                         Y
  Off-board Issue                        M
  Connection Instrumentation		 Y
  Extended Window Option                 Y
  High-Speed Implementation Techniques   M

TEXT (as of 5/10/89)

	MAXIMUM SEGMENT SIZE

  During a data transfer, the length of TCP segments has a
decided effect on performance.  The TCP connection parameter
that determines segment length is "MSS," for "Maximum
Segment Size."  Poor performance can result from an MSS that
is either too large or too small.

  A segment is a unit of data that is transferred between
TCP modules.  In other words, it is the data portion of a
TCP PDU.  It is segments that are acknowledged or
retransmitted by TCP modules.

  The motivation for using large segments is to minimize
the relative amount of header overhead and packet processing
during a data transfer.  Up to a point, the larger the
segments, the fewer IP datagrams a data transfer will
require.  Clearly, transmission bandwidth and memory is
saved if fewer headers are required.  Furthermore, the
processing needed for a data transfer are roughly
proportional to the number of packets, regardless of packet
length.  The efficiency gains from large MSS values end,
however, as soon as segments are so large that IP must
fragment them.

  If an IP datagram exceeds the MTU length of a network it must
traverse -- if it is too long for a network interface --
then the IP module will fragment the datagram.  Each
fragment contains the IP header, along with an offset, and a
flag to indicate whether a particular fragment is the last.
Further along, if they must traverse networks with even
smaller MTUs, then the fragments themselves may undergo
fragmentation.  Fragments are reassembled by the IP module
at their destination.  Hosts running IP must have the
ability to accept or reassemble datagrams with MTUs of 576
bytes. [Host Requirements RFC, Section 3.3.3].

  Because fragmentation results in the generation of
multiple IP headers and causes additional processing, it
immediately wipes out the advantages of using a large MSS.
The most detrimental effect on performance, however, occurs
at the destination.  Fragments are not delivered piecemeal:
IP must successfully reassemble an entire datagram before
delivering it to an upper-layer protocol.  If a single
fragment is lost, then the remaining datagram fragments will
eventually be dropped, and all resources that were used for
their transmission were wasted.  The ill effects of IP
fragmentation are compounded by the fact that there are some
gateway implementations that simply cannot perform correct
datagram fragmentation.  For a comprehensive analysis of the
effect of IP fragmentation on performance, see Mogul's
article, "Fragmentation Considered Harmful" [SIGCOM
citation].

  Given the above considerations, the optimum MSS for a
connection will be the maximum MSS that does not cause
fragmentation.*  This value is a function of the source,
destination, and type of service used by the connection.  As
a result of internet connectivity changes, the optimum MSS
value may change over time.

  The Host Requirements RFC requires that hosts implement
the MSS option, and recommends that an implementation use it
during connection establishment if its receive MSS differs
from 536 (Section 4.2.2.6).  The value of the MSS used
during data transfers is then computed from the formula:

 Eff.snd.MSS =

    min(SendMSS + 20, mts) - TCPhdrsize - IPoptionsize),

where "SendMSS" is the MSS received from remote host during
connection establishment (assumed to be 536 if the remote
peer does not use the MSS option), and "mts" is the maximum
length of a transport message that the local IP will send.
The value for mts is obtained from the local IP service
interface.

  An earlier approach to setting MSS was to use large
values only if there was special knowledge that no
fragmentation would result (e.g., if a the destination were
on the same ethernet), but otherwise to use an MSS of 536.
The algorithm given in the Host Requirements RFC is based on
the tenet that MSS discovery is best handled by IP.
____________________
*It is a common practice to round the MSS down to a
convenient block size.  For example, an MSS of 536 is
frequently approximated by 512.

	CONNECTION BLOCK MANAGEMENT

       The performance of a TCP implementation may be
  affected by the methods used to manage the connection
  state blocks, also known as Transmission Control Blocks or
  TCBs, associated with each TCP connection.  This will be
  especially true on systems which have many connections in
  progress simultaneously.

       When a TCP segment arrives, it is necessary to asso-
  ciate it with some TCB in order to decide what to do next.
  A TCP connection is uniquely identified by the 4-tuple

  {local address, local port, foreign address, foreign port}


  as described in RFC 793, section 2.7.  For active connec-
  tions, there will be exactly one TCB with matching values.
  Thus, one part of the processing of every incoming segment
  is to extract the address and port information and attempt
  to locate the corresponding TCB.  The techniques used to
  perform this matching operation should be carefully chosen
  for efficiency, particularly for the most common case.

       Special considerations may be required to deal with
  TCBs which exist in an inactive state.  Passive OPENs are
  one such case.  These issues will be discussed separately
  later.

       The simplest way of organizing the TCBs is as an
  unordered linear array or linked list. For small-system
  implementations of TCP, this may be the best approach.
  For a TCP which is supporting numerous connections, the
  time required to search through an unordered list may have
  a noticeable impact on processing overhead, and perhaps
  even on attainable throughput.  Also, this approach tends
  to make it unwieldy to support multiple, possibly overlap-
  ping partially-specified passive OPENs in a useful way.

       An alternative approach is to maintain a hash table
  with pointers to multiple linked lists.  This might be a
  good strategy for moderate numbers of connections.  It
  does raise the question of what key value and hash func-
  tion are desirable.  One possibility is to use the hash
  function:

                         H = LP mod N


  where LP is the local TCP port number, N is the size of
  the hash table, and H is the derived index into the hash
  table.  N could be chosen as a power of 2 to allow the
  hash function to be implemented with a bitwise masking
  instruction.  (Note that 8 is the smallest power of 2
  which will provide a unique hash for the assigned well
  known ports for TELNET, FTP, SMTP, and DOMAIN services.
  This is by chance, not by design.)

       It would seem preferable to use something more than
  the local port number as input to the hash function, espe-
  cially if many connections are expected to be made to the
  same local port (e.g., a TELNET terminal server).  How-
  ever, the following facts should be considered:

       (1) Hashing on the local address seems relatively
       unprofitable since the majority of systems are not
       multiply homed and thus have only one local address.

       (2) Hashing on local address, foreign address, or
       foreign port may lead to complications in handling
       TCBs created by partially-specified passive OPENs.
       Since they have not yet been bound to a foreign
       address and port, several TCBs may potentially match
       an incoming SYN segment.  An implementation should be
       structured to find the closest match (the one with
       the fewest unspecified fields).  If all partially-
       specified TCBs for a given local port number are on a
       single linked list, this structuring can be achieved
       by inserting new TCBs at the correct position in the
       list.

       Another way to speed up this matching operation is to
  arrange the list(s) of TCBs to cause the most frequently
  referenced TCBs to be checked first.  Again, many varia-
  tions on this strategy are possible.  The approach taken
  in the Berkeley networking code is to install each new
  active connection at the head of the list.  Since the
  Berkeley code uses the template/clone approach to imple-
  menting passive OPENs, the template passive OPENs for com-
  mon servers are typically created shortly after the system
  is booted, and thus their TCB's are at the end of the
  list.  When one of these TCBs is activated in response to
  an incoming SYN, a clone TCB is created and placed at the
  head of the list.

	TIMEWAIT AND CONNECTION REUSE

       In order to protect against old segments being inter-
  preted as part of the traffic stream of a new connection,
  the TCP protocol imposes certain rules and procedures for
  segment handling and interpretation.  The key elements of
  this mechanism are (1) a bound on maximum segment life-
  time, and (2) the "quiet time" concept.  Associated with
  (1) is the requirement for appropriately initializing,
  decrementing and testing the Time-to-live (TTL) field of
  the IP datagram header.  When both the TCP and IP speifi-
  cations are strictly followed, the TTL serves to bound the
  lifetime of a TCP segment traveling through the internet.
  RFC 793 specifies MSL as two minutes; this is noted as
  being a somewhat arbitrary choice.

       The other half of this integrity guard consists of
  each TCP's ensuring that it does not reuse the connection
  identifying information (source and destination IP
  addresses and TCP ports) for a connection for a period of
  time after it ceases to be used.  The description of the
  protocol state machine for TCP identifies a state called
  "TIME WAIT" which is entered when a connection has been
  closed by the peer TCPs.  If each TCP keeps connections in
  TIME WAIT state for at least the maximum segment lifetime,
  there should be no remaining old segments from that con-
  nection anywhere in the internetwork when the TIME WAIT
  state is finally exited.

       The implementation of TIME WAIT state requires main-
  taining a subset of the connection state information so
  that if any old segments do arrive, they can be identified
  as such and discarded.  An inconvenience associated with
  TIME WAIT state is the need to store this information for
  what in some contexts may be considered a long period of
  time (two minutes).  If TCP is to be used in some high-
  speed environment with many very brief connections being
  formed, used and closed, there may be a large number of
  connections in TIME WAIT state.  This could be a perfor-
  mance bottleneck in two respects: first, it results in
  more connection blocks to compare incoming segments with;
  second, these connection blocks may collectively consume a
  substantial amount of memory space.  The seriousness of
  these concerns is very much dependent on the context in
  which a particular implementation of TCP is being used in
  practice.

       Some implementors do not wish to tie up the port
  numbers for the length of time associated with TIME WAIT.
  Some operating systems place restrictions on the range of
  port numbers available to certain users; a problem could
  arise if the limit allows a user fewer ports than they
  plan to use in a time interval equal to MSL.  (Although
  each end of the connection can assign its port numbers
  separately, there is an effective limit to the utility of
  this, since the connection is typically being used to con-
  nect to a predefined port number at the destination which
  is assigned to the particular service or application being
  used, e.g., TELNET or DOMAIN.)  Speed limitations of net-
  works have largely protected users from this class of
  problem in the past, but increased network performance and
  new classes of applications may make it an issue.

       In order to partially circumvent the limitation, some
  implementors have suggested that the sequence number
  information of a closing connection be used to interpret
  any segments arriving for a connection in TIME WAIT state.
  If the incoming sequence number is in the half of the
  number space "above" the last sequence number of the TIME
  WAIT connection, it would be taken as referring to a dif-
  ferent connection.  This relies on the amount of time
  needed for the 32-bit sequence numbers to wrap around, as
  a substitute for a strict TIME WAIT.  At present, we prob-
  ably do not have any networks fast enough to cause this
  assumption to break down; at 100 megabits per second
  (FDDI), the sequence space takes between 5 and 6 minutes
  to wrap around fully, and thus allowing half of the
  sequence space to be used as the safety factor leaves a
  remaining padding of more than the MSL.  If TCP connec-
  tions can be made to operate at even faster rates, this
  method may break down.

       Another way to reduce the overhead cost of TIME WAIT
  connections is to reduce the amount of space used to
  retain their state.  For a strict implementation of TIME
  WAIT, it is only necessary to retain the address and port
  information.  This could afford some reduction in memory
  consumption for implementations where this is a concern.

	PUSH BIT

       The use of the "push" function in TCP may be somewhat
v  obscure to the novice.  The use and non-use of PUSH in
  various situations has an impact on the properties of the
  communication path as perceived by the end user.  However,
  a TCP implementation can and should be designed to cause
  PUSH to have only specific, intentional impacts.  The fol-
  lowing discussion reviews the definition of PUSH and the
  implementation characteristics which are desirable to
  obtain the best performance.

       First of all, it should be pointed out that the PUSH
  bit is an end-to-end control function operating on both
  the sending and receiving TCPs.  When the PUSH bit is not
  set, a TCP is permitted to buffer data for an indefinite
  period of time for efficiency reasons.  When the PUSH bit
  is set, indefinite buffering is not permitted, and the TCP
  is required to export currently buffered data to the
  appropriate interface at the first opportunity permitted.

	SENDING PROCESS'S VIEW OF PUSH

       It is essential, as a matter of correctness, for the
  PUSH function to be used at every point in communication
  where one party intends to cease transmission and wait for
  the other party to act.  Failure to use PUSH in these
  cases creates a substantial likelihood of communications
  deadlock due to buffered data remaining undelivered.  It
  is desirable to not use the PUSH function at any other
  time.  The overall performance of communication using TCP
  will be optimized by avoiding the PUSH function in those
  cases where buffering decisions may be safely left to the
  discretion of the TCP implementations at both ends of the
  connection.

       A TCP is not required to provide a selectable PUSH
  function to the user process which performs the SEND func-
  tion.  When it is not provided, every SEND operation
  implicitly selects the PUSH function.  This rule is needed
  to avoid the deadlock scenario previously mentioned.  A
  user process operating with such a TCP should be designed
  with appropriate consideration for the impact of this
  implicit PUSH.  Even if the sending TCP implements a seg-
  ment producing algorithm which effectively combines the
  SENDs, it may result in repeated data copying or other
  inefficiencies which impact the ultimate performance.

       Also, the production of many small packets may lead
  to undesirable overhead in I/O processing on either the
  sending or the receiving hardware.  (If the hardware and
  operating system give unconditional priority to I/O pro-
  cessing, the burden is borne by user processes.  The TCP
  congestion control mechanisms do not detect this type of
  burden until it becomes so large that the user processing
  backlog propagates back to the I/O environment, at which
  point the system may be completely useless from the
  viewpoint of an end user.)  In such situations, it may be
  desirable to implement some buffering in the user process
  so that less overhead is consumed by SENDs.

	SENDING TCP'S HANDLING OF PUSH

       In the case of a sending TCP, the use of the PUSH
  function implies that the outgoing data on hand should be
  made available to the segment-producing code, and any
  resulting segments should be transmitted via the IP layer
  interface at the first opportunity permitted by the con-
  straints of available window space, the Nagle algorithm,
  and the congestion control algorithm.  (Note that the PUSH
  function conceptually overrides the sender portion of
  silly window avoidance.) If the sending TCP does not allow
  the user process to specify whether to PUSH, it must cause
  the segment production mechanism to operate each time the
  user process executes the SEND function, and set the PUSH
  bit on the last segment produced as a result.

       If a sending TCP produces segments asynchronously
  with respect to the transmission via the IP layer, a
  situation will sometimes exist in which there are multiple
  segments queued for transmission via IP.  In this case, it
  is permitted and desirable for the sending TCP to combine
  segments before transmission in certain situations (such
  as when retransmission is occurring).  The sending TCP is
  NOT required to send each PUSHed segment separately.  When
  segments are combined, any resulting segment should have
  its PUSH bit set if it contains the last octet of any
  former PUSHed segment.

       If a sending TCP produces segments synchronously with
  respect to the transmission via the IP layer (as with the
  Berkeley networking code), it is adequate to remember the
  first and last PUSHed sequence numbers, and set the PUSH
  bit on all created segments containing any portion of that
  range of sequence numbers.

       <<WWB: The above paragraph seems bogus.  What is
  best?? Probably to remember every PUSHed sequence number.
  But, probably, nobody wants to do that(?)>>

	RECEIVING TCP'S HANDLING OF PUSH

       When a receiving TCP receives a segment with the PUSH
  bit set, it should transfer any data on hand to the
  receiving process as soon as it is possible to do so.
  This does not require that each PUSHed segment be
  delivered in response to one distinct RECV call by the
  user process.  The best implementation of PUSH handling in
  the receiving TCP will depend somewhat on the buffering
  strategy used between the TCP and the user process.  It
  should also take into account (to the extent practical)
  the status of the user process with respect to the connec-
  tion; in other words, if the user process is blocked
  because it has no data to process, the TCP should attempt
  to give it some data fairly quickly and enable the user
  process to run.

       If multiple data segments arrive at effectively the
  same time, there is some question of how best to deal with
  a PUSH in an earlier segment.  It is desirable to start
  the user process computing quickly, so that the data will
  begin to be put to use and application throughput will
  arguably be maximized.  However, it may be more efficient
  in certain systems to transfer data from as many incoming
  segments as possible into the hands of the user process
  before causing it to run.  The designer should consider
  the overall architecture and performance characteristics
  of the computing environment and choose an approach that
  minimizes the amount of time that is dissipated unproduc-
  tively.

	RECEIVING PROCESS'S VIEW OF PUSH

       The availability of a PUSH indication to the receiv-
  ing process is an optional feature of TCP implementations.
  If available, it may allow some performance optimizations
  in the application.  If an asynchronous "interrupt" or
  "signal" capability exists, this may be a desirable means
  of indicating the availability of PUSHed data, particu-
  larly if some type of shared memory buffer is used to pass
  data from the TCP to the user process.  If the interface
  is essentially synchronous, the PUSH indication will
  likely be a flag or return value of the RECV function.
  The PUSH indication is of limited practical utility;
  perhaps the most likely useful situation is that of a user
  process which itself performs communications functions.
  Such a process might use the PUSH indication as an indica-
  tion to select the corresponding functionality when it
  outputs the data.

	RTT ESTIMATION AND RTO CALCULATION

  In normal operation, TCP implementations use estimates
of Round Trip Time (RTT: the elapsed time between the
transmission of segments and the receipt of their
acknowledgements) as the basis for calculating the value for
Retransmission Timeout (RTO: the maximum time a TCP
implementation will await acknowledgement without
retransmitting a segment).  The RTO value is calculated
dynamically from RTT estimates so that TCP can operate over
a wide range of network hardware and under varying load
conditions.

  For acceptable TCP performance, the RTO must be neither
too low nor too high.  If a TCP implementation consistently
uses too large an RTO, then it will be slow in detecting
dropped segments.  Dropped packets will cause the TCP peers
to become inactive, as the sending TCP waits for RTO to
expire and the receiving TCP waits for a sequence number it
can acknowledge.  On the other hand, too low an RTO value
will result in spurious retransmissions, since the sending
TCP will misinterpret delays as loss.  This is clearly
wasteful, and can cause congestion.

  The TCP specification includes an algorithm "given...as
illustration," that calculates RTO based on an RTT estimate.
Though the algorithm is widely used in TCP implementations,
it has known flaws, and has proven to be inadequate.  The
two major problems with this RTO algorithm are:

     1. It does not address the issue of sampling RTTs
        in the event of segment retransmission; and

     2. It cannot cope with the high variation in RTT
        that typifies highly utilized systems.

  Phil Karn has developed an algorithm that solves the
first of the above problems (Karn & Partridge, 87), 
and Van Jacobson and Mike Karels
have developed an algorithm that solves the second (Jacobson,
88).  Experience has shown that using Jacobson's algorithm
to compute RTO based on RTT samples, and Karn's algorithm to
select RTT samples, can dramatically improve TCP
performance.  As a result, the Host Requirements RFC
mandates the use of these algorithms (Section 4.2.3.1).

  The RTT sampling problem that arises when segments are
retransmitted is that it is impossible for the sending TCP
to determine which transmission is being acknowledged.  In
other words, there is no way to tell whether the RTT should
be measured from the initial segment transmission, or from
the retransmission.  Use of either approach can seriously
corrupt the RTT estimation.  On the one hand, if RTT is
always measured from the most recent retransmission, then
implementation will severely underestimate RTT exactly when
increases in delay cause unnecessary retransmissions.  It is
even possible for an implementation using this approach to
enter a stable state in which it continually underestimates
RTT, and as a result transmits every segment several times.
On the other hand, if the more conservative approach is
used, then RTT will be overestimated for networks that drop
packets.  Because exponential RTO increases are mandated for
successive retransmissions (see section XX, "RTO Backoff"),
the effect of RTT overestimation can accumulate.  In
sufficiently lossy networks (e.g., if about 1 in three
packets are lost, and the losses are independent of each
other), RTO will tend to increase to and remain at its upper
bound.

  The Karn algorithm solves the problem of acknowledgement
ambiguity by discarding all RTT measurements that are
tainted by a retransmission.  The algorithm also stipulates
that RTO will not be calculated from RTT estimates until
legitimate RTT samples are available.  If retransmissions
occur, then "backed off" RTO values are retained for use by
the next segment (see section XX, "RTO Backoff").  Clamping
the RTO values in this fashion prevents TCP from injecting
excess traffic as a result of a poor RTT estimate.

  Jacobson's algorithm addresses problems in RTO
calculation.  Originally, RTO was calculated as:

            RTO = beta * SRTT,

where beta is a constant between 1.3 and 2, and SRTT is the
"Smoothed Round Trip Time."  SRTT is obtained by the
formula:

   SRTT(n+1) = alpha*SRTT(n) + (1 - alpha)RTT(n)

where alpha is a constant between 0.8 and 0.9, and
RTT(n) is the nth RTT measurement.  This approach for
calculating an average is known as "exponential smoothing."
It is a weighted average: the most recent observations are
given the most weight, while the impact of earlier
observations decreases as the algorithm is iterated.

  The problem with the original RTO calculation is that a
fixed multiplier is used to cope with the variance of round
trip delay.  This approach is inadequate for the Internet;
it is well known that RTT varies tremendously.  This is not
surprising: from queuing theory, we expect that systems
under heavy load will display large variance in delay.
Increasing a fixed beta sufficiently to accommodate the
range of delay variance would result in an RTO much too
inflated.

  In the Jacobson algorithm, RTO is calculated as:

          RTO = SRTT + 2 * MDEV,

where MDEV is the average absolute error -- the "mean
deviation" -- of SRTT.  Jacobson chose mean deviation rather
than standard deviation because it is much simpler to
calculate.  As he points out, however, the mean deviation is
itself a rather good estimator for standard deviation.  A
rough summary of the Jacobson algorithm is that delays as
great as roughly two standard deviations above average will
be accepted without retransmission.  The most important
feature of this algorithm is that RTO is dynamically
tailored for the sampled RTT variance.

  In Jacobson's algorithm, the estimate for MDEV is
derived from second-order exponential smoothing, as follows:

            Err = | RTT - SRTT |

    SErr(n+1) = gamma*SErr(n) + (1 - gamma)Err

where Serr(n) is the nth "smoothed error."  Because it is
desirable to have RTO adapt quickly when network load and
delay increases, the value for gamma should probably be
lower than the value used for alpha.  Jacobson suggests 0.875.

  In TCP implementations, SRTT and SErr are recalculated
often.  Hence, the calculation should be efficient.  In an
appendix of "Congestion Control and Avoidance,"
Jacobson presents an integer arithmetic implementation
of his algorithm that requires only 6 adds, three shifts,
and a compare, to get values for SRTT, SErr, and RTO.

  The procedures for calculating SRTT and SErr are
recursive, as each new SRTT and SErr is a function of the
previously calculated SRTTs or SErrs, respectively.  For the
first calculation, however, there are no previously
calculated values to use.  Hence, these algorithms need
seeds for the first calculation.  The Host Requirements RFC
encourages an initial value of 0 for RTT and 3 for RTO.  It
also recommends that the Jacobson algorithm begin operation
with the SYN segments exchanged during the 3-way handshake
of a TCP open (Section 4.2.3.1).

	RETRANSMISSION BACKOFF

  The interval between successive retransmissions of a TCP
segment must increase, or "back off," exponentially.  An
appropriate rate of increase is for the retransmission
timeout (RTO) to double for each transmission of the same
segment.  If an internet's TCP implementations do not employ
exponential backoff, then it will be vulnerable to
congestive collapse.  For this reason, the Host Requirements
RFC lists exponential backoff as a TCP requirement (Section
4.2.3.1).

  TCP retransmissions occur either because datagrams are
lost, or the actual round trip time of segment transmission
and its acknowledgement exceeds RTO.  In either event, the
root cause is, in most cases, congestion.  Hence, TCP
retransmissions often have the effect of adding load to a
system that is already past saturation.

  Work is in progress on a proof that an exponential RTO
increase is a requirement for stable, collapse-free
networks.  An informal argument to this effect is presented
in "Congestion Avoidance and Control" (SIGCOM 88), by Van
Jacobson.

  Karn's algorithm for selecting roundtrip time (RTT)
samples is also required by the Host Requirements RFC
(Section 4.2.3.1).  In general, the RTO is computed as a
function of estimated RTT.  In the Karn algorithm, however,
"backed off" values of RTO are retained until a segment is
acknowledged without being retransmitted.  The motivation
for this algorithm is similar to the motivation for
exponential backoff: because persisting traffic tie-ups can
result from adding traffic to a congested system, TCP
implementations should use conservative RTO values until
they can obtain sample RTTs that are uncorrupted by
retransmissions.  The Karn algorithm is readily integrated
with the RTT estimation procedures described in Section XX.

	OUT OF ORDER SEGMENT HANDLING

       The TCP specification (RFC 793) indicates that TCP
  implementations may choose whether or not to support the
  buffering of out-of-order segments.  In this context, the
  term "out of order" refers specifically to a segment
  arriving at a receiving TCP which does not contain the
  data octet or control function corresponding to the next
  sequence number expected by the receiving TCP, but is oth-
  erwise an acceptable segment.  In the terminology of RFC
  793, such a segment has SEG.SEQ > RCV.NXT.

       It is recommended that a TCP implementation endeavor
  to buffer out of order segments to the extent possible.
  RFC 793 notes that not buffering these segments has an
  unfavorable impact on performance; the remainder of this
  section explains the extent of the impact.

       It seems obvious that a sending TCP will generally
  transmit segments in sequence number order.  Therefore,
  when an out of order segment arrives at a destination TCP,
  some type of interesting event has occurred somewhere in
  the internetwork components between the sender and
  receiver.  There seem to be two primary classes of such
  events.  First, the loss of a segment in the network will
  cause any subsequent segments (those with higher SEG.SEQ)
  to appear to be "out of order".  In the second class of
  event, characteristics of the intervening network lead to
  the reordering of segments.  These cases will be con-
  sidered separately.

  	LOST SEGMENT CASE

       Segments may be effectively lost between the sender
  and receiver in several ways.  They may be deliberately
  dropped by gateways due to congestion, discarded by gate-
  ways or hosts due to data corruption resulting in IP or
  TCP checksum errors, misdelivered due to anomalies in
  routing, or they may just vanish due to some type of
  hardware or software failure.

       When a segment is lost for any of these reasons, sub-
  sequent segments in transit may continue to be delivered
  successfully.  In TCP, the recovery of the lost segment is
  accomplished by timer-driven retransmission from the send-
  ing TCP.  Buffering the out of order segments makes it
  unnecessary for subsequent segments to be retransmitted if
  they arrived at the destination TCP intact.  This strategy
  fits together with the requirement for the TCP retransmis-
  sion algorithm to cause only one segment to be retransmit-
  ted on expiration of the timer.  When the timer associated
  with the lost segment expires, that segment (or one con-
  taining a superset of its contents) will be retransmitted
  and the retransmission timer will be restarted with a
  longer duration.  If this retransmitted segment arrives
  successfully at the destination, the destination TCP will
  be able to process not only that segment, but also the
  buffered segments which were received "out of order".  The
  acknowledgement then sent will cover all of these seg-
  ments, and the sending TCP will not retransmit the other
  data which was sent previously.

       On the other hand, if the "out of order" segments
  were not buffered at the destination, the data contained
  in them will need to be retransmitted as a result of the
  one preceding segment being lost.  This retransmission may
  require several packet transit times, since the slow-start
  algorithm will be employed to reestablish the rate of
  flow.  (The use of slow-start is necessary here because
  the sending TCP cannot distinguish this set of events from
  a congestion-induced packet loss.  In fact, the original
  lost packet may have been lost due to congestion, or due
  to other types of channel errors for which slow-start is
  not logically necessary.)  [WWB: Not clear to me here
  whether there is something clever about VJ code to take
  advantage of an ACK for a huge amount of data to make the
  congestion window stuff not slow-start, or if this is a
  possible enhancement, or if it is not doable.  Requires
  thought.]

  	NETWORK REORDERING CASE

       Reordering of segments within a network or internet-
  work is a property which may seem to be inherently incom-
  patible with sensible implementations of network inter-
  faces.  However, when multiple physical or logical paths
  are used to carry the segments of a single TCP connection,
  the possibility of reordering exists.  This situation
  might occur as a transient event due to path changes as a
  result of load balancing.  It can also occur on a continu-
  ing basis when multiple physical or logical links are used
  to carry the segments of a single TCP connection.  For
  example, several implementations of IP over X.25 networks
  will open multiple virtual circuits in order to circumvent
  the limited window size supported by a particular
  network's X.25 implementation.  Or, there may be several
  relatively low-speed links which are equally good paths to
  the destination, and they may be used in a round-robin
  fashion to attempt to gain better throughput.

       In any of these situations, the possibility exists of
  reordering (as viewed by the destination) due to varying
  transit times on different links.  The transit time can
  vary due to internal fluctuations in loading, or it may
  differ simply because consecutive segments vary in size.
  For example, if a sending TCP transmits a large segment
  followed by a small segment and each of these segments
  travels through a different low-speed link, the small
  segment may well "arrive" first.

	IMPLEMENTATION CONSIDERATIONS

       A distinction should be understood between out of
  order segments which are (at least partially) within the
  window, and any other out of order segments.  The latter
  are, strictly speaking, "not acceptable" segments accord-
  ing to RFC 793.  However, some implementations may wish to
  retain some such segments anyway.  For example, if a
  receiving TCP ever "shrinks its window" by decreasing the
  quantity RCV.NXT+RCV.WND, there may be segments received
  which could be construed as potentially acceptable.

       Buffering out of order segments clearly consumes
  additional memory resources in a TCP implementation.  For
  many implementations, the space required is not large
  relative to the overall space available.  However, there
  will probably be some cases where buffer space is a scarce
  resource.  Several implementation strategies may be useful
  in such cases to try to achieve the benefits of out of
  order segment buffering without leading to unacceptable
  resource impacts.

       It is plausible that a TCP may purge out of order
  segments according to some pragmatic rule, such as only
  allowing some quantity X of out of order data to be buf-
  fered.  In such a case, the segments buffered should be
  those closer to the left edge of the window.

       When incoming segments tend to be small, as in the
  case of certain TELNET sessions, economies in space utili-
  zation may be achieved by combining adjacent out-of-order
  segments into larger segments or buffers.  There is a
  tradeoff between memory and processor utilization in this
  situation.

       It is plausible that a TCP might have a space
  recovery routine which is called when memory resources are
  found to be running low.  This routine might discard out-
  of-order data as part of its space recovery strategy.

       It is probably desirable to transmit an acknowledge-
  ment for out of order segments using the same rules as for
  other segments, whether the out of order segment is buf-
  fered or not.  Note that this does not mean that the ack-
  nowledgement sequence number is increased; it will typi-
  cally repeat the acknowledgement sequence number of a pre-
  vious segment.  Although a redundant acknowledgement may
  seem useless, it actually assists in maintaining
  throughput under some conditions.  In particular, if the
  sending TCP implements Karn's "fast retransmit" algorithm,
  the redundant acknowledgement would be recognized as indi-
  cating a missing segment, and would induce an immediate
  retransmission of the next segment.

	DELAYED ACKNOWLEDGEMENTS

       The TCP protocol provides an acknowledgement sequence
  number field in every segment.  When the ACK flag in the
  TCP header is set to 1, the value in this field (known as
  SEG.ACK in RFC 793) is significant, and indicates that all
  data octets or control flags corresponding to incoming
  sequence numbers through the number specified have been
  received by the acknowledging TCP.  This definition leaves
  open the question of when acknowledgements are sent.
  Since the field is present in every packet, it is obvi-
  ously useful to set the appropriate value in any outgoing
  data-carrying packet which happens to be created.  How-
  ever, if the flow of data on a TCP connection is irregular
  or completely unidirectional for a substantial period of
  time, it is an implementation choice as to when an
  acknowledgement-only segment is generated.

       A wide range of acknowledgement strategies can be
  imagined.  At one extreme, an implementation might
  transmit a separate acknowledgement for every segment it
  receives.  The opposite extreme would be to send an ack-
  nowledgement only when the window has been filled.  (Note
  that delaying acknowledgement any longer than this would
  create a deadlock.) Any strategy within this range has
  certain potential benefits and drawbacks from a perfor-
  mance perspective.

       Most obviously, the transmission of an ACK-only seg-
  ment uses a relatively large amount of network resources
  in relation to the information conveyed.  Other drawbacks
  of immediate acknowledgement for each segment include the
  possibility that the local user's processing of the incom-
  ing data may quickly create a need to generate another
  outgoing segment to update the offered window or to send
  some user data in response.  In either of these cases, it
  would be better to combine the acknowledgement, window
  update and response data into a single segment.  A notable
  example of this type of connection activity is the support
  of TELNET for character-asynchronous remote login sessions
  with remote echoing.

       The inherent problem with delayed acknowledgement is
  that it is not generally possible for the TCP to know for
  certain what is likely to occur (or how long it will take
  to occur) when incoming data is turned over to a process.
  The inevitable compromise is to delay sending the ack-
  nowledgement for some period of time, in the hope that
  some outgoing segment will be created for one reason or
  another.  If one is created, then the acknowledgement is
  "piggybacked" on that segment.  If no such segment is
  created within some interval, an ACK-only segment is
  created and transmitted.

       A strategy of deferring acknowledgements has a less
  obvious but potentially significant performance impact: it
  destabilizes the other TCP's estimated round-trip time for
  the connection, and thus may cause inappropriate
  retransmissions or delay appropriate ones.  In either
  case, connection throughput will be reduced.  Since exces-
  sive retransmissions also contribute to global congestion,
  it is particularly important to avoid this condition.

       It seems intuitively obvious that some intermediate
  strategy should provide optimal results.  It is unclear
  whether any one strategy is close to optimal for a wide
  range of operating environments.  Factors influencing the
  optimal strategy include the window size potentially
  available, the residual network layer error rate experi-
  enced by the connection, the minimum RTT, the mean RTT,
  the variability of the RTT, timing considerations in the
  local TCP, and the retransmission-related parameters and
  algorithms of the other TCP.

       If the segment transit times in both directions are
  (on average) the same, half of RTT will have been used up
  at the first possible moment for sending an ACK.  Since
  the other half of RTT is presumably required for transit
  of the ACK, the safe delay period is proportional to the
  variability-based component in the other TCP's RTO calcu-
  lation.

       The current recommendation is to acknowledge at least
  every other segment received, and not to delay an ack-
  nowledgement by over 0.5 second.  This is a pragmatic
  engineering compromise, not a demonstrably optimal solu-
  tion.  It permits up to 50% of ACK-only segments to be
  eliminated, thus presumably allowing a suitable implemen-
  tation to achieve much of the benefit obtainable from a
  reduction in overhead transmissions.

       <<WWB: Clarify whether this means one ack per two
  segments even if lots of segments arrive simultaneously>>

       Acknowledging at least every second segment (or any
  similar constraint) accomplishes three things:

       (1) It adapts the delay interval to the speed of the
       connection for fast connections (which is to say, for
       connections which pass more than four segments per
       second in one direction).

       (2) It provides for transmission of redundant ACKs in
       the case of a dropped segment, allowing the other TCP
       to effectively use the fast retransmit algorithm if
       implemented.

       (3) It reduces the likelihood of an unnecessary
       suspension of transmission by the other TCP when it
       applies the Nagle algorithm.

       The 0.5 second limit (or another absolute limit)
  accomplishes two things:

       (1) It clamps the delay for slow connections to avoid
       creating another source of delay and thus reducing
       throughput more than necessary.

       (2) It sets a bound on the perturbation in RTT vari-
       ance observed by the other TCP (so that RTO for
       high-delay connections doesn't grow excessively and
       impede any legitimate retransmissions).