Internet-Draft D January 2025
Yang & Yu Expires 3 August 2025 [Page]
Workgroup:
DSMC Working Group
Internet-Draft:
draft-yang-dmsc-distributed-model-00
Published:
Intended Status:
Standards Track
Expires:
Authors:
H. Yang
Beijing University of Posts and Telecommunications
TK. Yu
Beijing University of Posts and Telecommunications

Distributed AI model architecture for microservices communication and computing power scheduling

Abstract

This document describes the distributed AI micromodel computing power scheduling service architecture.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 3 August 2025.

Table of Contents

1. Introduction

The Distributed AI Micromodel Computing Power Scheduling Service Architecture is a structured framework designed to address the challenges of scalability, flexibility, and efficiency in modern AI systems. By integrating model segmentation, micro-model deployment, and microservice orchestration, this architecture enables the effective allocation and management of computing resources across distributed environments. The primary focus lies in leveraging model segmentation to decompose large AI models into smaller, modular micro-models, which are executed collaboratively across distributed nodes.

The architecture is organized into four tightly integrated layers, each with distinct roles and responsibilities that together ensure seamless functionality:

Service Layer: This layer acts as the interface between the user-facing applications and the underlying system. It encapsulates AI capabilities as microservices, enabling modular deployment, elastic scaling, and independent version control. By routing user requests through service gateways, it ensures efficient interaction with back-end micro-models while balancing workloads. The service layer also facilitates collaboration between multiple micro-models, allowing them to function as part of a cohesive distributed system.

Control Layer: The control layer is the central coordination hub, responsible for task scheduling, resource allocation, and the implementation of model segmentation strategies. It decomposes large AI models into smaller, manageable components, assigns tasks to specific nodes, and ensures synchronized execution across distributed environments. This layer dynamically balances compute and network resources while adapting to system demands, ensuring high efficiency for training and inference workflows.

Computing Power Layer: As the execution core, this layer translates the decisions made by the control layer into distributed computation. It executes segmented micro-models on diverse hardware resources such as GPUs, CPUs, and accelerators, optimizing parallelism and fault tolerance. By coordinating with the control layer, it ensures that tasks are executed efficiently while leveraging distributed orchestration frameworks to handle diverse workloads.

Data Layer: The data layer underpins the entire system by managing secure storage, access, and transmission of data. It provides the necessary datasets, intermediate results, and metadata required for executing segmented micro-models. Privacy protection mechanisms, such as federated learning and differential privacy, ensure data security and compliance, while distributed database operations guarantee consistent access and high availability across nodes.

At the heart of this architecture is model segmentation, which serves as the foundation for effectively distributing computation and optimizing resource utilization. The control layer breaks down models into smaller micro-models using strategies such as layer-based, business-specific, or block-based segmentation. These micro-models are then deployed as independent services in the service layer, where they are dynamically scaled and orchestrated to meet real-time demands. The computing power layer executes these tasks using parallel processing techniques and advanced scheduling algorithms, while the data layer ensures secure and efficient data flow to support both training and inference tasks.

By tightly integrating these layers, the architecture addresses critical challenges such as balancing compute and network resources, synchronizing distributed micro-models, and minimizing communication overhead. This cohesive design enables AI systems to achieve high performance, scalability, and flexibility across dynamic and resource-intensive workloads.

This document outlines the design principles, key components, and operational advantages of the Distributed AI Micromodel Computing Power Scheduling Service Architecture, emphasizing how model segmentation, micro-models, and microservices form the foundation for scalable and efficient distributed AI systems.

2. Conventions used in this document

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in .

3. Terminology

TBD

4. Scenarios and requirements

To provide a more structured and logical analysis, we can organize the insights into ICN’s development trajectory by following a chronological order and emphasizing how different RFCs complement and enhance each other. This approach will allow us to see how ICN evolved and how various RFCs address specific aspects of its implementation, challenges, and optimization.

4.1. AI Microservice model scenario requirements

At present, with the accelerated evolution of artificial intelligence technology, the scale and complexity of AI models continue to expand, and the traditional monolithic application or centralized reasoning and training mode is increasingly difficult to meet the rapidly changing business needs. Encapsulating AI capabilities as microservices can bring significant advantages in terms of system flexibility, scalability, and service governance. By decoupling models through microservices, a separate AI model service can avoid potential bottlenecks caused by deep coupling with the rest of the business logic, and can also scale elastic when requests or training load surges. For AI models, the iteration and upgrade speed is very fast. Microservice architecture makes it possible to coexist multiple versions of the model, grayscale release, and fast rollback, thereby reducing the impact on the overall system.

AI microservice models are often extremely demanding on computing power. On the one hand, the training or inference process usually involves massive data processing and high-density parallel computing, which requires the collaborative work of various hardware resources such as GPU, CPU, FPGA, and NPU. On the other hand, if the model scale is large or the amount of requests is high, the computing power of a single machine is often not enough to meet the business needs. It is necessary to use the distributed mode to perform parallel computing on multiple nodes, and reasonably release resources during idle time to improve utilization. This type of distributed training or inference usually relies on efficient communication strategies to synchronize model parameters or gradients. Methods such as AllReduce or All-to-All are often used to reduce communication overhead and ensure model consistency.

In a distributed environment, the role played by the network is crucial. A large number of model parameters and gradients need to be exchanged frequently during the calculation process, which puts forward high requirements for network bandwidth and delay. In the large-scale cluster scenario, the reasonable design of the network topology and the choice of the communication framework can not be ignored. Only in the high-bandwidth, low-latency network environment, combined with the appropriate communication library (such as NCCL, MPI, etc.), can the cluster fully exploit the potential of computing power and avoid communication becoming the bottleneck of global performance.

4.2. Distributed Micro model Service Flow

In the distributed AI micro-model computing power scheduling service architecture, the core of the business process is how to realize the multi-node layout and collaborative work of the model to ensure efficient parameter synchronization and communication. Typically, a model is trained and evaluated by a data scientist or algorithm engineer using a deep learning framework during development, and then container-ized or mirrored to package the model and its dependencies into a service that can be deployed independently. Then, these encapsulated model services are registered to the system's microservice management platform for subsequent unified scheduling and access. As AI models evolve rapidly, version management and grayscale releases are the norm. Small validation or quick rollback of new versions while keeping old ones online can minimize risk and ensure user experience.

When the model is deployed to a distributed cluster, computing power orchestration and resource scheduling allocate computing resources such as Gpus or cpus according to real-time load, business priority and hardware topology, and use container orchestration tools (such as Kubernetes) to start corresponding service instances on each node. When distributed cooperation is needed, NCCL, Horovod and other frameworks are used to complete inter-process communication. For inference scenarios, requests from upper business systems or users usually arrive at the API Gateway or service gateway first, and then are distributed to the target service instance according to load balancing or other routing policies. If distributed reasoning is needed, multiple nodes cooperate to perform model segmentation reasoning and summarize the results, and finally return the reasoning results to the requester. For the training scenario, when a distributed training task is triggered, the scheduler allocates several worker nodes for the training job to complete data loading, forward and backward propagation, and AllReduce or All-to-All communication operations together to realize the synchronous update of model parameters. After the training is complete, the new version of the model is saved to the model repository or a corresponding storage medium, triggering the subsequent model release process.

In this process, real-time monitoring and elastic scaling mechanism play an important role in ensuring system stability and optimizing resource utilization. On the monitoring level, through a unified data acquisition and analysis platform, the system can track core indicators such as GPU utilization, network traffic, and request latency of each service node, so as to provide timely alarms in case of failures, performance bottlenecks, or insufficient resources, and perform automatic failover or node offline processing. In terms of elastic scaling, according to the preset resource utilization threshold or response time target, the system will dynamically increase the scale of model service instances or nodes when the number of requests surges or the training scale expands, otherwise it will reclaim part of the idle resources to ensure efficient cooperation between global computing and storage.

In addition, the distributed micromodel business flow needs to be combined with the data backflow mechanism. A large number of logs, user feedback and interaction information generated in the inference process can be further used for the training of new models or the performance optimization of existing models if they can be returned to the data platform under the premise of meeting privacy and compliance requirements.

5. Key issues and challenges

5.1. Balancing Compute and Network Resources under Constraints

As AI models grow in scale and business demands intensify, single-node or single-cluster architectures often struggle to support high-intensity training and inference tasks. This leads to limitations in computational power or significant cost surges. Distributed training has emerged as a necessary approach to address these challenges by enabling the coordination of computing resources across multiple nodes and regions, thereby improving overall efficiency and fault tolerance. However, distributed deployment also introduces considerable complexity, such as handling heterogeneous hardware differences (e.g., GPU, CPU, FPGA) and balancing resource allocation across diverse network topologies and bandwidth conditions.

One of the key difficulties lies in managing resource scarcity effectively. Dynamic scheduling and allocation must account for factors such as business priority, model scale, and real-time workload conditions. Strategies such as priority-based queuing, elastic scaling, and cross-cluster resource collaboration are crucial to maximizing service efficiency under these constraints. However, implementing these strategies often depends on sophisticated partitioning and parallelism approaches.

In distributed training, model partitioning and parallelism play a pivotal role. By employing techniques like tensor slicing or computing power pipelining, models can be decomposed and distributed across multiple nodes, with each node handling specific submodules or slices. This approach is particularly effective in training scenarios, where workload distribution ensures that no single server becomes a bottleneck. Similarly, in inference scenarios, input data can flow through a sequence of model microservices in a pipelined processing framework, which helps to maximize the utilization of scattered computing resources.

Despite the potential benefits, these strategies are not without challenges. Distributed training inherently requires efficient synchronization and communication between nodes, which are further constrained by network resource availability. Moreover, achieving balance between computational and network resources demands meticulous planning and real-time adaptation. While model partitioning and parallel execution help alleviate pressure on individual servers and utilize idle nodes more effectively, they also add layers of complexity to resource coordination in distributed environments.

5.2. Data Collaboration Challenges under Block Isolation

In distributed systems, large-scale data is often divided into multiple blocks that are stored and processed separately. While this improves data security and processing efficiency, it introduces significant challenges for data collaboration. When multiple nodes or microservice modules need to share or exchange data, strict coordination is required, including defining interfaces and call sequences in advance and managing consistency and concurrency control. The complexity increases further when cross-node dependencies exist between data blocks, making the scheduling, loading, and distribution of data one of the primary bottlenecks for system scalability and computational efficiency.

A key difficulty lies in synchronizing data across distributed nodes while minimizing latency and avoiding bottlenecks. Cross-node dependencies require precise scheduling to ensure data arrives at the correct location and time without conflicts. As the scale of data and the number of nodes grow, the management overhead for maintaining these dependencies can increase exponentially, particularly when network bandwidth or latency constraints exacerbate delays. Additionally, ensuring data consistency across multiple data blocks during concurrent access or updates adds another layer of complexity. High levels of concurrency can increase the risk of inconsistencies, data races, and synchronization issues, demanding advanced mechanisms to enforce data integrity.

Traditional distributed communication strategies, such as AllReduce and All-to-All, are widely used and remain effective in addressing certain data collaboration needs in training and inference tasks. For example, AllReduce is well-suited for data parallel scenarios, where all nodes compute on the same model with different data splits, and gradients or weights are synchronized via aggregation and broadcast. Similarly, All-to-All is valuable in more complex distributed tasks that require frequent intermediate data exchanges across nodes. However, these methods are not without limitations. As data and system complexity grow, they can lead to increased communication overhead, especially in scenarios where synchronization is uneven or poorly timed.

The effectiveness of traditional methods depends on careful tuning and precise execution. Poorly timed data exchanges can result in prolonged waiting times, underutilized resources, or even data mismatches. Although methods like AllReduce and All-to-All offer reliable frameworks for communication, their scalability and efficiency are often constrained by the challenges of cross-node synchronization, network variability, and system heterogeneity. These limitations highlight the need for continuous refinement and innovation in distributed communication and data collaboration strategies to overcome the challenges posed by block isolation.

6. Distributed solution based on model segmentation

4.1. Service layer

The service layer serves as the central hub of a distributed AI system, connecting the front-end, business logic, and microservices to enable efficient interaction and seamless workflows. It hosts the core service logic, processes user and business requests, and coordinates the collaboration of multiple components.

At the front-end layer, the service layer interacts with user-facing interfaces, which handle tasks such as user authentication, data input, and result presentation. These interfaces act as the entry point for system requests, routing them through APIs provided by a service gateway. The gateway manages external request routing, authentication, protocol translation, and load balancing to ensure smooth and efficient communication between the user interface and the back-end services.

7. IANA Considerations

TBD

8. Acknowledgement

TBD

Authors' Addresses

Hui Yang
Beijing University of Posts and Telecommunications
Beijing
Beijing,
China
Tiankuo Yu
Beijing University of Posts and Telecommunications
Beijing
Beijing,
China