High Performance Wide Area Network                                K. Yao
Internet-Draft                                                   H. Yang
Intended status: Informational                              China Mobile
Expires: 18 April 2025                                   15 October 2024


   Gap analysis of transport protocols of high performance wide area
                                network
                draft-yy-hpwan-transport-gap-analysis-00

Abstract

   This document analyzes the throughput performance of existing
   transport protocols under different implementation modes, including
   kernel space based mode, user space based mode, and offloading-based
   implementations, and concludes that existing technologies are either
   limited by host CPU overhead or by the complexity of offloading, and
   cannot guarantee high throughput approaching the bandwidth rate of
   the network adapter.  Accordingly, this document proposes new
   requirements for the design of HP-WAN transport protocol.

Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [RFC2119].

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 18 April 2025.

Copyright Notice

   Copyright (c) 2024 IETF Trust and the persons identified as the
   document authors.  All rights reserved.


Yao & Yang                Expires 18 April 2025                 [Page 1]

Internet-Draft  Web and Internet Transport Working Group    October 2024


   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components
   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
   2.  Definition of Terms . . . . . . . . . . . . . . . . . . . . .   3
   3.  Gap Analysis of Existing Transport Solutions for HP-WAN . . .   4
     3.1.  TCP/IP Stack Running in Kernel Space  . . . . . . . . . .   4
     3.2.  TCP/IP Stack Running in User Space  . . . . . . . . . . .   5
     3.3.  Offloading TCP/IP Stack to Network Adapters . . . . . . .   5
       3.3.1.  Transport Offload Engine(TOE) . . . . . . . . . . . .   5
       3.3.2.  RDMA  . . . . . . . . . . . . . . . . . . . . . . . .   6
   4.  Requirements for New HP-WAN mechanisms  . . . . . . . . . . .   7
     4.1.  Support RDMA  . . . . . . . . . . . . . . . . . . . . . .   7
     4.2.  Lightweight Transport Layer . . . . . . . . . . . . . . .   8
       4.2.1.  Congestion Control  . . . . . . . . . . . . . . . . .   8
       4.2.2.  Reliability . . . . . . . . . . . . . . . . . . . . .   8
       4.2.3.  Multi-path transmission . . . . . . . . . . . . . . .   8
       4.2.4.  Other Requirements  . . . . . . . . . . . . . . . . .   9
     4.3.  Application Developer Friendly Interfaces . . . . . . . .   9
   5.  Security Considerations . . . . . . . . . . . . . . . . . . .   9
   6.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .   9
   7.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .   9
   8.  References  . . . . . . . . . . . . . . . . . . . . . . . . .   9
     8.1.  Normative References  . . . . . . . . . . . . . . . . . .   9
     8.2.  Informative References  . . . . . . . . . . . . . . . . .  10
   Appendix A.  Appendix A . . . . . . . . . . . . . . . . . . . . .  10
     A.1.  TCP Performance Test with TOE Enabled . . . . . . . . . .  10
     A.2.  QUIC Performance Test with TOE Enabled  . . . . . . . . .  11
     A.3.  iWARP Performance(Referenced from SC07 Paper) . . . . . .  12
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  13

1.  Introduction

   HP-WAN requires meeting the needs of massive data transmission over
   long-distant, lossy, and shared wide-area network infrastructure,
   with data transmission volumes typically in the terabytes(TBs) to
   petabytes(PBs) range, and throughput is the key performance
   indicator.  The design of transport protocols is critical, including
   rate control, congestion control, and multi-stream processing.
   Transport protocols such as TCP and QUIC, which run in the kernel


Yao & Yang                Expires 18 April 2025                 [Page 2]

Internet-Draft  Web and Internet Transport Working Group    October 2024


   space, can improve end-to-end data transmission throughput to some
   extent through the optimizations of congestion control and multi-path
   transmission algorithms.  But they inevitably introduce excessive CPU
   overhead, resulting in that actual throughput cannot approach the
   bandwidth of the endpoint network adapters, and increasing more
   operational costs.

   Currently, super-fast Ethernet has become the industry trend, with
   commercial products available at 400G and accelerating towards 800G
   and higher T-bit level bandwidth evolution, while the performance
   growth rate of CPUs has become slower, the gap between the two growth
   rates is becoming more obvious, so the utilization rate of the
   endpoint CPU for data transmission is becoming increasingly important
   for improving endpoint throughput.

   This document analyzes the throughput performance of existing
   transport protocols under different implementation modes, including
   kernel space based mode, user space based mode, and offloading-based
   implementations, and concludes that existing technologies are either
   limited by host CPU overhead or by the complexity of offloading, and
   cannot guarantee high throughput approaching the bandwidth rate of
   the network adapter.  Accordingly, this document proposes new
   requirements for the design of HP-WAN transport protocol.

2.  Definition of Terms

   This document makes use of the following terms:

   TOE:  Transport Offload Engine.  A series of techniques to offload
     part of TCP protocol stack or TCP processing to network adapters.

   RDMA:  Remote Direct Memory Access.  Bypass CPU to access the memory
     of the other network endpoint.

   RoCEv2:  RDMA over Converged Ethernet version 2.  The second version
     of applying RDMA transport layer which originates from Inifiniband
     over TCP/IP stack.

   iWARP:  internet Wide Area RDMA Protocol.  A protocol suite that
     contains several layers and protocols to realize RDMA functionality
     over TCP/IP stack.


Yao & Yang                Expires 18 April 2025                 [Page 3]

Internet-Draft  Web and Internet Transport Working Group    October 2024


   Even though this document is not a protocol specification, it makes
   use of upper case key words to define requirements unambiguously.
   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
   "OPTIONAL" in this document are to be interpreted as described in BCP
   14 [RFC2119] [RFC8174] when, and only when, they appear in all
   capitals, as shown here.

3.  Gap Analysis of Existing Transport Solutions for HP-WAN

   There are three main ways to implement TCP/IP protocol stack: running
   in user space, running in kernel space, and offloading-based
   solutions.  Offloading-based methods can be further divided into
   partial offloading to assist kernels and RDMA.  Both fully running in
   kernel space and running in user space lead to very high CPU overhead
   and suffer from throughput performance bottlenecks.  Partial
   offloading methods, such as TOE, can reduce CPU overhead to some
   extent, but still face throughput performance bottlenecks.  Existing
   RDMA implementations either require support from intermediate network
   node or have high offloading complexity, and still cannot guarantee
   satisfactory performance requirements.

3.1.  TCP/IP Stack Running in Kernel Space

   TCP, QUIC, and other mainstream transport protocols primarily run in
   the kernel space of operating systems such as Linux.  After long-term
   community version iterations and maintenance, these protocols have
   excellent reliability and scalability, such that they can meet the
   performance requirements of most Internet applications, such as
   multimedia and web applications.  However, the core requirement of
   HP-WAN services is the extremely high data throughput.  Although the
   throughput performance of these kernel-based protocols can also be
   improved, with the maturity of Bottleneck Bandwidth and Round-trip
   propagation time(BBR) [I-D.cardwell-ccwg-bbr], the CPU overhead is
   still very high.

   Reasons why these kernel-based transport protocols consume high CPU
   resources are that:

   * Frequent copy between kernel space and user space, especially on
   the receiver side, leads to excessive CPU resource consumption.

   * Interruption operations take up a lot of CPU virtual cores and
   there is a contention between the interruption and the processing of
   TCP/IP protocol stack.


Yao & Yang                Expires 18 April 2025                 [Page 4]

Internet-Draft  Web and Internet Transport Working Group    October 2024


   * Multi-stream transmission can improve the total throughput to a
   certain extent, but in general, a stream needs to bind a CPU core,
   and multiple streams will compete for CPU resources, resulting in CPU
   load imbalance and affecting the actual throughput.

   * In modern non-uniform memory access(NUMA) architecture, there is a
   lot of communication overhead between CPU cores.

   * Control messages take up a lot of CPU resources.

   Implementing complete TCP/IP transport protocol stack in the kernel
   space of operating systems inevitably introduces the above problems,
   even though they are not in the scope of the IETF definition, but
   will affect the choice of technologies and protocol design.  This is
   especially true when throughput performance metrics are critical.
   And it should be considered in the transport protocol design space.

3.2.  TCP/IP Stack Running in User Space

   The advantage of running complete TCP/IP protocol stack in user space
   is that it can effectively reduce the memory copy overhead between
   user space and kernel space, and even achieve zero-copy, which can
   reduce the processing latency, but it still introduces CPU overhead.
   The main reasons are the inability of threads and processes to share
   Socket and the overhead of multi-threaded locks.  For example, when
   the main thread listens for new connection establishment in the
   Socket, the system needs to start a child process to handle the
   request, and the child process needs to access the newly connected
   Socket.  But at the same time, the parent process needs to listen to
   the Socket for new connections.  Socket contention between parent and
   child processes and the lock maintenance in multi-threaded
   applications are both detrimental to throughput performance.

3.3.  Offloading TCP/IP Stack to Network Adapters

3.3.1.  Transport Offload Engine(TOE)

   TOE is a partial offloading technology, which offloads part of the
   TCP/IP stack to the hardware of the network adapter, but some of the
   protocol stack still runs in the kernel space of the operating
   system.  This technique can reduce the CPU overhead to some extent.
   Many mature TOE technologies have been accepted by the industry, such
   as generic segmentation offload(GSO), large receiver offload(LRO),
   receiver side scaling(RSS), and checksum offloading.  However, TOE
   technology has a performance bottleneck.  For example, Appendix A has
   listed the throughput performance measurement of TCP and QUIC
   protocols under different congestion control algorithms like BBR and
   CUBIC[RFC9438], with TOE enabled.


Yao & Yang                Expires 18 April 2025                 [Page 5]

Internet-Draft  Web and Internet Transport Working Group    October 2024


3.3.2.  RDMA

   RDMA is an important technology to realize CPU by-pass, and there are
   standardized technologies applied in in data center networks and wide
   area networks, such as RoCEv2 and iWARP, etc., which has a history of
   about 20 years.  But the current RDMA technology has some
   limitations.  RoCEv2 is oriented to data center networks and requires
   lossless features from the network.  RoCEv2 is not defined by IETF.
   iWARP is designed for data center networks, local area networks and
   wide area networks.  Due to the underlying TCP protocol, iWARP is
   reliable without intermediate network support, which is aligned with
   design goal of HP-WAN transport protocol, but iWARP still has
   performance limitations.

3.3.2.1.  RoCEv2

   RoCEv2 is defined by Infiniband Trade Association(IBTA) which
   compatibly adapts RDMA transport layer originally from Infiniband to
   TCP/IP protocol stack, and it is more in line with the development
   trend of Ethernet ecosystem.  The condition of RoCEv2 to guarantee
   high throughput data transmission is that the network needs to
   provide congestion notification mechanisms such as explicit
   congestion notification(ECN) and data center bridging(DCB), and the
   ability of lossless transmission, such as priority-based flow
   control(PFC) mechanism.  In short-distant data center networks, it is
   feasible to guarantee such capabilities, but in the Internet,
   providing such capabilities, especially for lossless networks, is
   extremely expensive.  Therefore, native RoCEv2 is not suitable for
   high throughput data transmission in wide-area lossy environments.

3.3.2.2.  iWARP

   iWARP protocol suite was defined in IETF, and it was based on TCP/IP
   stack so that it can work reliably itself.  It consists of three main
   layers and protocols to achieve RDMA functionality over data center
   network, local area network, and wide area network.  The first is the
   RDMA semantic layer where RDMA read, write and send semantics are
   defined.  RDMAP RFC 5040 [RFC5040] is in this layer.  The second is
   the RDMA core functionality layer where message segmentation and
   reassembly, buffer models, and direct data placement in-order message
   delivery are defined.  RDDP RFC 5041 [RFC5041] is in this layer.  The
   third is the boundary marking layer, where data framing and
   integrity, and Framed Protocol Data Unit (FPDU) alignment are
   defined.  MPA RFC 5044 [RFC5044]is in this layer.

   Three offloading modes of iWARP are defined in [SC07], namely host-
   based, host-offloaded, and host-assisted.  Host-based only offloads
   TCP stack to the network adapter.  Host-offloaded mode offloads DDP,


Yao & Yang                Expires 18 April 2025                 [Page 6]

Internet-Draft  Web and Internet Transport Working Group    October 2024


   MPA and TCP stack to the network adapter, and host-assisted retains
   RDMAP and Markers function of MPA layer in host.  Only DDP, CRC and
   TCP stack are offloaded.  There is an obvious gap between the
   performance of the three.  More details on the test performance are
   attached in the appendix.  The conclusion is that three
   implementations can not guarantee high throughput, and the bandwidth
   of the network adapter can not be fully utilized.  Perhaps this is
   part of the reasons why iWARP has less industry support and
   implementations.

    +----------------+ +----------------+ +------------------+
    | +-----++----+  | |    +-----+     | | +-----++-------+ |
    | |RDMAP||RDDP|  | |    |RDMAP|     | | |RDMAP||Markers| |
    | +-----++----+  | |    +-----+     | | +-----++-------+ |
    | +---++-------+ | +-------+--------+ +--------+---------+
    | |CRC||Markers| |         |                   |
    | +---++-------+ |         |                   |        HOST
    +-------+--------+         |                   |
   ---------+------------------+-------------------+------------
            |                  |                   |
    +-------v--------+ +-------v--------+   +------v-------+
    |    +-------+   | |+----++-------+ |   | +----+ +---+ |
    |    |TCP/IP |   | ||RDDP||Markers| |   | |RDDP| |CRC| |
    |    +-------+   | |+----++-------+ |   | +----+ +---+ | NIC
    +----------------+ |+---+ +-------+ |   |   +------+   |
                       ||CRC| |TCP/IP | |   |   |TCP/IP|   |
                       |+---+ +-------+ |   |   +------+   |
                       +----------------+   +--------------+

                Figure 1: Three Offloading methods of iWARP

4.  Requirements for New HP-WAN mechanisms

   Based on the analysis above, exsiting tranport solutions introduce
   high CPU cost or high offloading complexity, which result in
   performance bottleneck.  These solutions are not evolvable to satisfy
   HP-WAN high-throughput performance requirements in T-bit level
   Ethernet era.  Therefore, the following new requirements are
   proposed.

4.1.  Support RDMA

   For applications with ultra-high throughput transmission
   requirements, RDMA MUST be supported to reduce CPU overhead.  It is
   RECOMMENDED that RDMA establish a connection with one handshake and
   breaks the connection with one handshake at the semantic level.
   Write operation MUST be supported, but read operation SHOULD be
   optional.


Yao & Yang                Expires 18 April 2025                 [Page 7]

Internet-Draft  Web and Internet Transport Working Group    October 2024


4.2.  Lightweight Transport Layer

   The iWARP protocol suite can fully offload TCP/IP transport layer,
   but due to the limitations of space and time complexity, performance
   still has a bottleneck, especially on throughput.  Therefore, RDMA
   cannot be based on a complete TCP transport protocol implementation,
   and a lightweight transport protocol must be designed.  On the other
   hand, the transport protocol should still have some reliability and
   congestion management functions to not depend on any auxiliary
   capabilities provided by the network layer.

4.2.1.  Congestion Control

   In terms of congestion control, the mechanism for congestion
   detection is a key factor in the algorithm.  Delay, bandwidth, and
   packet loss are all important signals for determining congestion, but
   they have their own advantages and disadvantages in terms of false
   positive rate, fairness, and algorithm complexity.  Therefore, when
   congestion control mechanisms, there are some requirements that can
   be considered:

   * TCP's slow start and congestion avoidance mechanism MUST be
   improved to enhance the utilization of bandwidth.

   * It is RECOMMNED to transform the congestion window mechanism into a
   time-interval-based sending rate control mechanism which can be
   easier to implement on network adapter.

4.2.2.  Reliability

   TCP's rate control relies on the sliding window mechanism, which
   ensures better reliability but also limits the maximum sending
   capability of the sender side to some extent.  TCP's ability to
   handle out-of-order reception SHOULD be retained, and data packets
   are stored as soon as they arrive, reducing the occupancy of
   resources.  In terms of handling lost packets, a precise re-
   transmission mechanism MUST be designed, with only lost packet re-
   transmission notification and supporting cumulative acknowledgement,
   reducing the impact of packet loss and transmission distance on
   throughput performance.

4.2.3.  Multi-path transmission

   Multi-path transmission SHOULD be supported.  In a single connection,
   multiple streams are maintained in parallel, and the streams are
   independent.  Parameters of congestion control can be set separately.


Yao & Yang                Expires 18 April 2025                 [Page 8]

Internet-Draft  Web and Internet Transport Working Group    October 2024


4.2.4.  Other Requirements

   In addition, security MUST be considered in most cases.  Packet
   SHOULD be designed with high load ratio to fully consider the
   complexity of offloading.  At the same time, the above components
   SHOULD be designed modularly, that is, congestion control, lost
   packet re-transmission, and data encryption can be turned on and off
   on demand.

4.3.  Application Developer Friendly Interfaces

   It MUST not change the programming interfaces of mainstream
   applications such as Verbs, Libfabric, Socket, etc.  At present, RDMA
   mainly supports Verbs and Libfabric interfaces, and big data
   transmission applications need to develop adaptations for these
   interfaces.  If there are distributed applications requiring huge
   amount of data movement based on Socket interface in the future, the
   development of network adapter driver MUST be oriented to Socket
   interface.  However, RDMA and Socket abstraction do not match, API
   compatibility is poor, and it is necessary to support standardized
   programming interface.

5.  Security Considerations

   TBD.

6.  IANA Considerations

   TBD.

7.  Acknowledgements

   Authors would like to thank other team members from China Mobile for
   their contributions.  Guangyu Zhao, Shiping Xu, Zongpeng Du, Zhiqiang
   Li.

8.  References

8.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.


Yao & Yang                Expires 18 April 2025                 [Page 9]

Internet-Draft  Web and Internet Transport Working Group    October 2024


   [RFC5040]  Recio, R., Metzler, B., Culley, P., Hilland, J., and D.
              Garcia, "A Remote Direct Memory Access Protocol
              Specification", RFC 5040, DOI 10.17487/RFC5040, October
              2007, <https://www.rfc-editor.org/info/rfc5040>.

   [RFC5041]  Shah, H., Pinkerton, J., Recio, R., and P. Culley, "Direct
              Data Placement over Reliable Transports", RFC 5041,
              DOI 10.17487/RFC5041, October 2007,
              <https://www.rfc-editor.org/info/rfc5041>.

   [RFC5044]  Culley, P., Elzur, U., Recio, R., Bailey, S., and J.
              Carrier, "Marker PDU Aligned Framing for TCP
              Specification", RFC 5044, DOI 10.17487/RFC5044, October
              2007, <https://www.rfc-editor.org/info/rfc5044>.

   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

   [RFC9438]  Xu, L., Ha, S., Rhee, I., Goel, V., and L. Eggert, Ed.,
              "CUBIC for Fast and Long-Distance Networks", RFC 9438,
              DOI 10.17487/RFC9438, August 2023,
              <https://www.rfc-editor.org/info/rfc9438>.

8.2.  Informative References

   [I-D.cardwell-ccwg-bbr]
              Cardwell, N., Swett, I., and J. Beshay, "BBR Congestion
              Control", Work in Progress, Internet-Draft, draft-
              cardwell-ccwg-bbr-00, 22 July 2024,
              <https://datatracker.ietf.org/doc/html/draft-cardwell-
              ccwg-bbr-00>.

   [SC07]     Balaji, P., "Analyzing the impact of supporting out-of-
              order communication on in-order performance with iWARP",
              2007.

Appendix A.  Appendix A

A.1.  TCP Performance Test with TOE Enabled

   TCP was tested under the following environments: CPU type: Intel Xeon
   ® Gold 6248R processor, number of cores: 4x24=96, clock frequency:
   3.0GHz, network adapter: 100Gbps Nvidia CX-5, PCIE 3.0x16, and
   operating system: CentOS 8.


Yao & Yang                Expires 18 April 2025                [Page 10]

Internet-Draft  Web and Internet Transport Working Group    October 2024


   The experiment is based on the laboratory environment, and the
   single-stream and multi-stream throughput performance are tested
   under different congestion control algorithms.  BBRv1 and Cubic are
   tested under the condition of 0.1% and 1% packet loss.  It can be
   seen that the throughput performance of the BBR algorithm can reach
   more than 10Gbps in the case of a single stream, and the throughput
   reaches a peak of more than 80Gbps in the case of 25 streams sharing
   the network.  The throughput performance of BBR algorithm is much
   higher than CUBIC, because CUBIC is very sensitive to packet loss.
   The performance of BBR still can not reach the ceiling of bandwidth.

   The reasons are that on the one hand, the receiver needs to maintain
   more overhead introduced by packet loss lookup and packet loss
   recovery.  On the other hand, TOE's optimization effect becomes poor.
   These effects lead to CPU utilization rate approaches the bottleneck.

   +---------+------------+----------+------------+----------+
   |         | TCP+BBRv1  |TCP+BBRv1 | TCP+CUBIC  |TCP+CUBIC |
   |         |0.1%Pkt loss|1%Pkt loss|0.1%Pkt loss|1%Pkt loss|
   +---------+------------+----------+------------+----------+
   | Single  |   14Gbps   |  10Gbps  |   8.6Mbps  |   Null   |
   | Stream  |            |          |            |          |
   +---------+------------+----------+------------+----------+
   |    3    |   41Gbps   | 24.5Gbps |   28Mbps   |   Null   |
   | Streams |            |          |            |          |
   +---------+------------+----------+------------+----------+
   |   10    |   70Gbps   |  61Gbps  |   91Mbps   |   Null   |
   | Streams |            |          |            |          |
   +---------+------------+----------+------------+----------+
   |   25    |   84Gbps   | 84.7Gbps |    Null    |   Null   |
   | Streams |            |          |            |          |
   +---------+------------+----------+------------+----------+

           Figure 2: TCP throughput performance,RTT=70ms,MTU=1500

A.2.  QUIC Performance Test with TOE Enabled

   QUIC performance test is under the following environments: CPU type:
   Intel Xeon Gold 6248 @2.5GHz, 20 cores each.  Network adapter: Nvidia
   CX-5, dual-port 100G, PCIE: 3.0x16.  This test is over China Mobile
   backbone network(CMNET) from Harbin to Guiyang, over 3000 kilometers
   with average RTT around 65ms.  The average packet loss is around
   0.01% to 0.1%.

   BBRv1 is selected as the congestion control algorithm, and the test
   results are obtained under the condition of CPU utilization
   approaching saturation.  In the test case of 40 CPU cores, when the
   number of streams exceeds 40, the throughput is maintained at about


Yao & Yang                Expires 18 April 2025                [Page 11]

Internet-Draft  Web and Internet Transport Working Group    October 2024


   50Gbps.  In the test case of 80 CPU cores, when the number of streams
   exceeds 80, the throughput is maintained at about 63Gbps.  Under the
   two test conditions, there is a big gap between the theoretical limit
   of 100Gbps bandwidth of network adapter and the real data
   transmission throughput.  The results confirm the CPU overhead plays
   a pivotal role in endpoint data transmission throughput.

   +---------+------------+------------+
   |         | QUIC+BBRv1 | QUIC+BBRv1 |
   |         |0.1%Pkt loss|0.1%Pkt loss|
   |         |  40 cores  |  80 cores  |
   +---------+------------+------------+
   |   40    |   47.2Gbps | 52.8Gbps   |
   | Streams |            |            |
   +---------+------------+------------+
   |   60    |   42.4Gbps | 57.2Gbps   |
   | Streams |            |            |
   +---------+------------+------------+
   |   80    |   51.2Gbps | 62.4Gbps   |
   | Streams |            |            |
   +---------+------------+------------+
   |   100   |   NULL     | 63.2Gbps   |
   | Streams |            |            |
   +---------+------------+------------+

         Figure 3: QUIC Throughput Performance, RTT=65ms, MTU=1500

A.3.  iWARP Performance(Referenced from SC07 Paper)

   Test environment: Chelsio 10GE NIC, 2 Intel Xeon 3.0GHz processors,
   Redhat OS.

   The figure shows that the three offloading methods have very
   different performance in terms of actual network adapter throughput
   and host CPU utilization.  In the processing of 128B small messages,
   the CPU utilization of the three methods is relatively low, but the
   throughput is also very low.  When processing larger message of
   256KB, it can be seen that although the CPU utilization rate of the
   host-offloaded method is very low, which can be maintained at 10%,
   its throughput is only 3.5GB, about 1/3 of the actual network adapter
   bandwidth, and its throughput performance is not good.  The best
   throughput performance is achieved in host-assisted mode, which is
   partially offloaded.  But the best throughput is only about 60% of
   the network adapter bandwidth, and its CPU utilization has reached
   80%.  The host-based model performs worse in terms of both bandwidth
   and CPU utilization.


Yao & Yang                Expires 18 April 2025                [Page 12]

Internet-Draft  Web and Internet Transport Working Group    October 2024


   The above results show that there are performance bottlenecks in all
   the three implementations of iWARP, which cannot maintain high
   throughput close to the physical bandwidth of network adapters with
   low CPU consumption.  One of the main influencing factors is the
   Markers function in MPA layer.  Markers have an important role in
   marking the functional boundaries of TCP and RDDP protocols.
   Offloading Markers to network adapters requires too much state
   maintenance resources, which affects the actual throughput.  If the
   Markers function is implemented inside the host, the CPU usage will
   be too high.  Therefore, although TCP-based iWARP has good
   reliability and scalability, its performance is still limited and
   needs to be further improved.

   +---------+----------+----------+----------+
   |         | MSG size | MSG size | MSG size |
   |         |  128B    |   1KB    |   256KB  |
   |         +-----+----+-----+----+-----+----+
   |         | BW  |CPU | BW  |CPU | BW  |CPU |
   |         |     |Util|     |Util|     |Util|
   +---------+-----+----+-----+----+-----+----+
   |  host-  |100MB|12% |800MB|40% |1.8GB|75% |
   |  based  |     |    |     |    |     |    |
   +---------+-----+----+-----+----+-----+----+
   |  host-  |1GB  |10% |3.5GB|10% |3.5GB|10% |
   |offloaded|     |    |     |    |     |    |
   +---------+-----+----+-----+----+-----+----+
   |  host-  |500MB|18% |3.5GB|50% |5.8GB|80% |
   | assisted|     |    |     |    |     |    |
   +---------+-----+----+-----+----+-----+----+

    Figure 4: Performance Comparision of Three Offloading modes of iWARP

Authors' Addresses

   Kehan Yao
   China Mobile
   Beijing
   100053
   China
   Email: yaokehan@chinamobile.com


   Hongwei Yang
   China Mobile
   Beijing
   100053
   China
   Email: yanghongwei@chinamobile.com


Yao & Yang                Expires 18 April 2025                [Page 13]