Internet-Draft TP for INC January 2024
Song, et al. Expires 27 July 2024 [Page]
Workgroup:
Network Working Group
Internet-Draft:
draft-song-inc-transport-protocol-req-01
Published:
Intended Status:
Informational
Expires:
Authors:
H. Song
Futurewei Technologies
W. Wu
Peking University
D. Kutscher
The Hong Kong University of Science and Technology (Guangzhou)

The Requirements of a Unified Transport Protocol for In-Network Computing in Support of RPC-based Applications

Abstract

In-network computing breaks the end-to-end principle and introduces new challenges to the transport layer functionalities. This draft provides the background of a suite of RPC-based applications which can take advantage of INC support, surveys the existing transport protocols to show they are insufficient or improper to be used in this context, and lays out the requirements to develop a general transport protocol tailored for such applications. The purpose of this draft is to help understand the problem domain and inspire the design and development a unified INC transport protocol.

Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119].

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 27 July 2024.

Table of Contents

1. Motivation

In a broader sense, COmputing-In-Network (COIN) covers many distinct types of applications which rely on networks to do more than packet forwarding (e.g., active networking, edge computing, and service function chaining). However, the emerging term In-Network Computing (INC) [inc] in particular refers to a narrower scope which applies on-path programmable networking devices (e.g., switches and routers between clients and servers) as an accelerator or function offloader to boost throughput, reduce server load, or improve latency, typically in a well-controlled data center network environment.

Some INC implementations evolved from programmable data plane systems and align with the trend of network programmability at large. In recent year, it has been shown to support many promising applications (e.g., caching, aggregation, and agreement). For example, in distributed machine learning (DML), training nodes produce data (gradients) that needs to be aggregated or reduced -- and the result could be distributed to one or multiple consumers. As another example, the NetClone system [netclone] uses in-network forwarder to replicate RPC invocation messages and to perform more informed forwarding based on observed latencies for accelerating RPC communication.

While it is possible to achieve this kind of operation purely with end-to-end communication between worker nodes, performance can be dramatically improved by offloading both the operation processing and the data dissemination to nodes in the network. These in-network processors are often conceived as semi-transparent performance enhancing on-path elements, i.e., they are not the actual endpoints in transport protocol sessions and would intercept packets with application data and potentially generate new data that they would have to transmit.

The intended INC behavior can thus not be achieved with existing end-to-end transport protocols such as TCP and QUIC. Conventionally, the network devices are only supposed to process the packets up to the network layer and leave the upper layers (i.e., transport layer and application layer) intact for the end hosts to process; however, INC requires the network devices to participate in the application logic so inevitably they need to process the related packets up to the application layer, as shown in Figure 1.


                  /-------------------\
                 /     INC devices     \
+-----------+   /     +-----------+     \    +-----------+
|application|   |     |application|     |    |application|
+-----------+   |     +-----------+     |    +-----------+
| transport |   |     | transport |     |    | transport |
+-----------+   |     +-----------+     |    +-----------+
|  network  |<--+---->|  network  |<----+--->|  network  |
+-----------+   \     +-----------+     /    +-----------+
   client        \---------------------/         server
                         network
Figure 1: Network Protocol Stack in INC

In the context of the INC systems we refer to here, the computing functions need to be done in data plane fast path. There may be other use cases where a network device needs to direct the application packets to the slow path (e.g., a local CPU or a remote server) for processing, which we do not consider here.

Programmable data plane devices use different programming languages (e.g., P4 and HDL) and have different chip architectures (e.g., RMT pipeline, RTC, and FPGA). These devices are optimized for simple packet processing and forwarding with limited hardware resources. Specifically, the devices are difficult to support complex stateful operations and mathematical calculations beyond integer addition and shift. No surprise the in-network computing functions for the supported applications are all relatively simple (e.g., resorting to lookup tables or counters). However, the programmable switch chip technology is also progressing fast with better stateful operation support and computing capabilities. It is conceivable that future programmable switches could undertake more computing tasks, albeit still in a facilitating role.

To correctly handle the computing tasks, however, a reliable transport layer must be present. The transport layer provides the common services such as connection maintenance, reliability, flow control, and multiplexing. The existing INC applications either make oversimplified assumption to eschew this problem (e.g., assume the use of UDP as the transport layer protocol or ignore it) or provided ad hoc solution dedicated to a particular application which entangles the transport and application functions (e.g., ATP). A general protocol for the transport layer is needed for INC to take care the common transport issues. It can free the application developers from worrying about the transport issues and help them focus on the application logic itself.

This draft provides the background of a suite of RPC-based applications which can take advantage of INC support, surveys the existing transport protocols to show they are insufficient or improper to be used in this context, and lays out the requirements to develop a general transport protocol tailored for such applications. The purpose of this draft is to help understand the problem domain and inspire the design and development a unified INC transport protocol.

2. INC Application RPCs

The INC applications concerned in this draft all follow the communication paradigm of idempotent Remote Procedure Call (RPC): A client sends a message with arguments to a server and gets a response back which reflects the computation result based on the arguments. On the one hand, it is unlike TCP which is mainly used for transferring byte streams; on the other hand, it requires a reliable datagram service more than what UDP can support.

We can classify these INC applications into three service models:

Synchronous Collaboration (SC):
from a set of clients, each sends a piece of data to a server roughly at the same time. The result can be computed and sent back to the clients when all the data pieces are received. A notable example is AllReduce (one operation in the class of Collective Communication [I-D.yao-tsvwg-cco-problem-statement-and-usecases]). Quite often there is one result that needs to be transmitted back to all clients, i.e., a multi-destination delivery service could be applied.
Asynchronous Collaboration (AC):
from a set of clients, each sends multiple data items to a server. The result can be computed when all the data items are received. An example of such applications is MapReduce
Individual Request (IR):
a client sends individual requests to a server and get a response for each request. An example of such application is NetCache [netcache].

From a different perspective, we can observe that there are three basic communication modes depending on the applications, as shown in Figure 2. From a client-perspective, the INC support is transparent, i.e., the client sends a message, such as an RPC, and if there is an on-path INC device, it could execute the operation, as an optimization. If there is no such on-path INC device, the message would be transmitted to a specified endpoint. Depending on the actual network configuration, capabilities, and load situation, one of the following modes can be selected:

Device Only Mode (DO):
the INC network devices alone can completely finish a computing task. Therefore a client can choose to send a task to the INC network devices instead of a server and the final result is directly returned to the client from the INC network devices.
Device+Server Mode (DS):
the INC network devices can only partially finish a computing task and the intermediate result still needs to be sent to a server to finalize. The final result must be returned to the client from a server.
Hybrid Mode (HM):
the INC network devices may or may not finish a computing task, therefore the final result may be returned by the INC network devices or by a server.

Each mode has its dominant benefits: Using DO mainly aims to reduce the latency and using DS mainly aims to reduce the traffic bandwidth and server load. Using HM may achieve both benefits, albeit with more implementation complexity.


                   +-------+
+------+         +-------+ |        +------+
|      |         |network| |        |      |
|client|<------->|devices| |        |server|
|      |         |       |-+        |      |
+--^---+         +-------+          +---^--+
   |                                    |
   +------------------------------------+
               Device Only Mode (DO)

                   +-------+
+------+         +-------+ |        +------+
|      |         |network| |        |      |
|client+-------->|devices+-+------->|server|
|      |         |       |-+        |      |
+--^---+         +-------+          +--+---+
   |                                   |
   +-----------------------------------+
              Device+Server Mode (DS)

                   +-------+
+------+         +-------+ |        +------+
|      |         |network| |        |      |
|client+-------->|devices+.........>|server|
|      |<--------|       |-+        |      |
+--^---+         +-------+          +--.---+
   :                                   :
   .....................................
              Hybrid Mode (HM)

Figure 2: In Network Computing Working Modes

Figure 3 provides the dominant combinations of the service model and communication model. Since AC may require too much resources which exceed network device's capability, so it is less used with the DO mode; IR usually aims to optimize the response latency, so the DS mode is less helpful, yet HM may provide a fallback mechanism for unsatisfied requests.

+-----------------------+-----+-----+-----+
|                       | DO  | DS  | HM  |
+-----------------------+-----+-----+-----+
|Sync Collaboration(SC) |  x  |  x  |  x  |
+-----------------------+-----+-----+-----+
|Async Collaboration(AC)|     |  x  |     |
+-----------------------+-----+-----+-----+
|Individual Request(IR) |  x  |     |  x  |
+-----------------------+-----+-----+-----+
Figure 3: Service Model and Communication Model

3. Existing Transport Protocols

We argue that the existing transport protocols are not suitable for INC.

TCP:
As the most widely used transport protocol, TCP (as well as its variants such as DCTCP and MPTCP) is ruled out because of its end-to-end streaming semantics. Any mutation to the TCP packet payloads is consider a break to the stream, but the INC applications which require network device collaboration do need to modify the packet payload. Also, any dropped packet in a TCP stream sensed by the receiver must be re-transmitted; this prohibits the INC applications which can terminate a packet and return the computing result directly. While theoretically it is possible to make the network device maintain two separate TCP connections with the two communicating end hosts, the cost of implementation is prohibitively large. Due to its handshake overhead and its longer startup times, TCP is also not a good protocol for high-performance RPC communication [davie]. More issues about TCP in data center can be found in [homa].
UDP:
As another common transport protocol, UDP is unreliable and lack of mechanisms for flow control. Some previous INC application assumes the use of UDP as the transport layer for simplicity, but the provisional measure cannot meet the production level requirement and provide enough transport layer support for all the concerned INC applications. While these feature could be implemented on-top of UDP, this would shift complexity to applications and INC implementations.
QUIC:
In general, QUIC provide a better platform for efficient RPC communication compared to TCP [davie]. However, it is designed for wide area network, and a part of the packet header and the payload are encrypted which prohibits the application layer packet processing in network devices and, potentially, add meta data.
MTP:
MTP [mtp] is the first transport protocol dedicated for INC. It grasps some core requirements for INC and is open to different congestion control algorithms. But it is inspired by the pathlet routing and mainly focus on pathlet-based congestion control support. It is lack of efficient support to all the application types aforementioned.
RDMA:
RDMA allows two end hosts to exchange data quickly. With either native support (i.e., Infiniband) or piggybacked by UDP or TCP, it requires in-order and immutable transport which has similar challenges as TCP for INC applications.
HOMA:
HOMA [homa] is proposed to be a transport protocol in data center to replace TCP. However, HOMA is not designed with INC in mind either.
Information-Centric Networking
(ICN) provide a receiver-driven, data-oriented communication services and has features such address-less operation due to the named-data access principle. It also provide intrinsic multi-destination delivery and has been demonstrated in remote method invocation and distributed computing scenarios [icndiscomp], albeit not yet the particular INC scenarios as presented here.
Ad Hoc Protocols:
Several INC applications (e.g., ATP and ASK) provide a customized transport layer. However, these protocols only work for a particular application. Moreover, there is a lack of a clear separation between the transport layer and the application layer. Some application layer function leaks into the transport layer, further limiting their generality.

4. Requirements

The premise of the E2E principle is that it is more costly to guarantee the level of reliability by relying on the network than relying on the end hosts. INC introduces multiple end points in the communication with one of them resides in the network, effectively changing the communication paradigm from E2E to E2I2E (I means intermediate nodes which conduct the transport layer functionalities). Therefore, we need to revisit the E2E principle to see if we can break it or adapt to it in the new context. We can observe several properties for the covered INC applications.

Based on these observation, a new transport layer protocol, for INC in support of RPC-based applications can be designed. The protocol only works in a limited domain and it virtualizes the network as a single logical middle point. That is, if multiple network devices collaborate on a computing task, they are considered as one device. Packet forwarding among these devices needs to be handled by the network layer using techniques such as Segment Routing (SR) and Service Function Chaining (SFC), depending on the overall system design.

From the previous discussion, we lay out the design requirements of a transport protocol dedicated for INC:

Simplicity:
Due to the limited resource and capability of the programmable network devices, the transport layer functions in them cannot be complex. For example, the per-flow state machine and congestion control algorithms are difficult to be implemented in the programmable network devices. The protocol should aim to leave the complexity to the end hosts and require only simple processing in the programmable network devices.
Generality:
The different service models and communication models should be all supported. The protocol should also be independent of the underlying network layer protocol.
Openness:
Since the performance requirements of the applications may vary, the flow control and reliability mechanism of the protocol should be open to different algorithms.
Compatibility:
The protocol should be able to coexist with the other transport protocols.

5. IANA Considerations

This document includes no request to IANA.

6. Security Considerations

tbd

7. References

7.1. Normative References

[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/info/rfc2119>.

7.2. Informative References

[davie]
Davie, B., "QUIC is not a TCP Replacement", https://systemsapproach.substack.com/p/quic-is-not-a-tcp-replacement, .
[homa]
Ousterhout, J., "It's Time to Replace TCP in the Datacenter", , <http://dx.doi.org/10.48550/arXiv.2210.00714>.
[I-D.yao-tsvwg-cco-problem-statement-and-usecases]
Yao, K., Shiping, X., Li, Y., Huang, H., and D. KUTSCHER, "Collective Communication Optimization: Problem Statement and Use cases", Work in Progress, Internet-Draft, draft-yao-tsvwg-cco-problem-statement-and-usecases-00, , <https://datatracker.ietf.org/doc/html/draft-yao-tsvwg-cco-problem-statement-and-usecases-00>.
[icndiscomp]
Geng, W., Zhang, Y., Kutscher, D., Kumar, A., Tarkoma, S., and P. Hui, "SoK: Distributed Computing in ICN", In Proceedings of the 10th ACM Conference on Information-Centric Networking (ACM ICN '23). Association for Computing Machinery, New York, NY, USA, 88-100. https://doi.org/10.1145/3623565.3623712, .
[inc]
Klenk et al., B., "An In-Network Architecture for Accelerating Shared-Memory Multiprocessor Collectives", ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), , <https:dx.doi.org/10.1109/ISCA45697.2020.00085>.
[mtp]
Stephens, B., Grassi, D., Almasi, H., Ji, T., Vamanan, B., and A. Akella, "TCP is Harmful to In-Network Computing: Designing a Message Transport Protocol (MTP)", , <http://dx.doi.org/10.1145/3484266.3487382>.
[netcache]
Jin, X., Li, X., Zhang, H., Soule, R., Lee, J., Foster, N., Kim, C., and I. Stoica, "NetCache: Balancing Key-Value Stores with Fast In-Network Caching", In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP '17). Association for Computing Machinery, New York, NY, USA, 121-136. https://doi.org/10.1145/3132747.3132764, .
[netclone]
Kim, G., "NetClone: Fast, Scalable, and Dynamic Request Cloning for Microsecond-Scale RPCs", In Proceedings of the ACM SIGCOMM 2023 Conference (ACM SIGCOMM '23). Association for Computing Machinery, New York, NY, USA, 195-207, , <https://dl.acm.org/doi/10.1145/3603269.3604820>.

Authors' Addresses

Haoyu Song
Futurewei Technologies
Santa Clara, CA
United States of America
Wenfei Wu
Peking University
Beijing
China
Dirk Kutscher
The Hong Kong University of Science and Technology (Guangzhou)
Guangzhou
China