Internet-Draft | D | January 2025 |
Yang & Yu | Expires 3 August 2025 | [Page] |
This document describes the distributed AI micromodel computing power scheduling service architecture.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 3 August 2025.¶
Copyright (c) 2025 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
The Distributed AI Micromodel Computing Power Scheduling Service Architecture is a structured framework designed to address the challenges of scalability, flexibility, and efficiency in modern AI systems. By integrating model segmentation, micro-model deployment, and microservice orchestration, this architecture enables the effective allocation and management of computing resources across distributed environments. The primary focus lies in leveraging model segmentation to decompose large AI models into smaller, modular micro-models, which are executed collaboratively across distributed nodes.¶
The architecture is organized into four tightly integrated layers, each with distinct roles and responsibilities that together ensure seamless functionality:¶
Service Layer: This layer acts as the interface between the user-facing applications and the underlying system. It encapsulates AI capabilities as microservices, enabling modular deployment, elastic scaling, and independent version control. By routing user requests through service gateways, it ensures efficient interaction with back-end micro-models while balancing workloads. The service layer also facilitates collaboration between multiple micro-models, allowing them to function as part of a cohesive distributed system.¶
Control Layer: The control layer is the central coordination hub, responsible for task scheduling, resource allocation, and the implementation of model segmentation strategies. It decomposes large AI models into smaller, manageable components, assigns tasks to specific nodes, and ensures synchronized execution across distributed environments. This layer dynamically balances compute and network resources while adapting to system demands, ensuring high efficiency for training and inference workflows.¶
Computing Power Layer: As the execution core, this layer translates the decisions made by the control layer into distributed computation. It executes segmented micro-models on diverse hardware resources such as GPUs, CPUs, and accelerators, optimizing parallelism and fault tolerance. By coordinating with the control layer, it ensures that tasks are executed efficiently while leveraging distributed orchestration frameworks to handle diverse workloads.¶
Data Layer: The data layer underpins the entire system by managing secure storage, access, and transmission of data. It provides the necessary datasets, intermediate results, and metadata required for executing segmented micro-models. Privacy protection mechanisms, such as federated learning and differential privacy, ensure data security and compliance, while distributed database operations guarantee consistent access and high availability across nodes.¶
At the heart of this architecture is model segmentation, which serves as the foundation for effectively distributing computation and optimizing resource utilization. The control layer breaks down models into smaller micro-models using strategies such as layer-based, business-specific, or block-based segmentation. These micro-models are then deployed as independent services in the service layer, where they are dynamically scaled and orchestrated to meet real-time demands. The computing power layer executes these tasks using parallel processing techniques and advanced scheduling algorithms, while the data layer ensures secure and efficient data flow to support both training and inference tasks.¶
By tightly integrating these layers, the architecture addresses critical challenges such as balancing compute and network resources, synchronizing distributed micro-models, and minimizing communication overhead. This cohesive design enables AI systems to achieve high performance, scalability, and flexibility across dynamic and resource-intensive workloads.¶
This document outlines the design principles, key components, and operational advantages of the Distributed AI Micromodel Computing Power Scheduling Service Architecture, emphasizing how model segmentation, micro-models, and microservices form the foundation for scalable and efficient distributed AI systems.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in .¶
TBD¶
To provide a more structured and logical analysis, we can organize the insights into ICN’s development trajectory by following a chronological order and emphasizing how different RFCs complement and enhance each other. This approach will allow us to see how ICN evolved and how various RFCs address specific aspects of its implementation, challenges, and optimization.¶
At present, with the accelerated evolution of artificial intelligence technology, the scale and complexity of AI models continue to expand, and the traditional monolithic application or centralized reasoning and training mode is increasingly difficult to meet the rapidly changing business needs. Encapsulating AI capabilities as microservices can bring significant advantages in terms of system flexibility, scalability, and service governance. By decoupling models through microservices, a separate AI model service can avoid potential bottlenecks caused by deep coupling with the rest of the business logic, and can also scale elastic when requests or training load surges. For AI models, the iteration and upgrade speed is very fast. Microservice architecture makes it possible to coexist multiple versions of the model, grayscale release, and fast rollback, thereby reducing the impact on the overall system.¶
AI microservice models are often extremely demanding on computing power. On the one hand, the training or inference process usually involves massive data processing and high-density parallel computing, which requires the collaborative work of various hardware resources such as GPU, CPU, FPGA, and NPU. On the other hand, if the model scale is large or the amount of requests is high, the computing power of a single machine is often not enough to meet the business needs. It is necessary to use the distributed mode to perform parallel computing on multiple nodes, and reasonably release resources during idle time to improve utilization. This type of distributed training or inference usually relies on efficient communication strategies to synchronize model parameters or gradients. Methods such as AllReduce or All-to-All are often used to reduce communication overhead and ensure model consistency.¶
In a distributed environment, the role played by the network is crucial. A large number of model parameters and gradients need to be exchanged frequently during the calculation process, which puts forward high requirements for network bandwidth and delay. In the large-scale cluster scenario, the reasonable design of the network topology and the choice of the communication framework can not be ignored. Only in the high-bandwidth, low-latency network environment, combined with the appropriate communication library (such as NCCL, MPI, etc.), can the cluster fully exploit the potential of computing power and avoid communication becoming the bottleneck of global performance.¶
In the distributed AI micro-model computing power scheduling service architecture, the core of the business process is how to realize the multi-node layout and collaborative work of the model to ensure efficient parameter synchronization and communication. Typically, a model is trained and evaluated by a data scientist or algorithm engineer using a deep learning framework during development, and then container-ized or mirrored to package the model and its dependencies into a service that can be deployed independently. Then, these encapsulated model services are registered to the system's microservice management platform for subsequent unified scheduling and access. As AI models evolve rapidly, version management and grayscale releases are the norm. Small validation or quick rollback of new versions while keeping old ones online can minimize risk and ensure user experience.¶
When the model is deployed to a distributed cluster, computing power orchestration and resource scheduling allocate computing resources such as Gpus or cpus according to real-time load, business priority and hardware topology, and use container orchestration tools (such as Kubernetes) to start corresponding service instances on each node. When distributed cooperation is needed, NCCL, Horovod and other frameworks are used to complete inter-process communication. For inference scenarios, requests from upper business systems or users usually arrive at the API Gateway or service gateway first, and then are distributed to the target service instance according to load balancing or other routing policies. If distributed reasoning is needed, multiple nodes cooperate to perform model segmentation reasoning and summarize the results, and finally return the reasoning results to the requester. For the training scenario, when a distributed training task is triggered, the scheduler allocates several worker nodes for the training job to complete data loading, forward and backward propagation, and AllReduce or All-to-All communication operations together to realize the synchronous update of model parameters. After the training is complete, the new version of the model is saved to the model repository or a corresponding storage medium, triggering the subsequent model release process.¶
In this process, real-time monitoring and elastic scaling mechanism play an important role in ensuring system stability and optimizing resource utilization. On the monitoring level, through a unified data acquisition and analysis platform, the system can track core indicators such as GPU utilization, network traffic, and request latency of each service node, so as to provide timely alarms in case of failures, performance bottlenecks, or insufficient resources, and perform automatic failover or node offline processing. In terms of elastic scaling, according to the preset resource utilization threshold or response time target, the system will dynamically increase the scale of model service instances or nodes when the number of requests surges or the training scale expands, otherwise it will reclaim part of the idle resources to ensure efficient cooperation between global computing and storage.¶
In addition, the distributed micromodel business flow needs to be combined with the data backflow mechanism. A large number of logs, user feedback and interaction information generated in the inference process can be further used for the training of new models or the performance optimization of existing models if they can be returned to the data platform under the premise of meeting privacy and compliance requirements.¶
As AI models grow in scale and business demands intensify, single-node or single-cluster architectures often struggle to support high-intensity training and inference tasks. This leads to limitations in computational power or significant cost surges. Distributed training has emerged as a necessary approach to address these challenges by enabling the coordination of computing resources across multiple nodes and regions, thereby improving overall efficiency and fault tolerance. However, distributed deployment also introduces considerable complexity, such as handling heterogeneous hardware differences (e.g., GPU, CPU, FPGA) and balancing resource allocation across diverse network topologies and bandwidth conditions.¶
One of the key difficulties lies in managing resource scarcity effectively. Dynamic scheduling and allocation must account for factors such as business priority, model scale, and real-time workload conditions. Strategies such as priority-based queuing, elastic scaling, and cross-cluster resource collaboration are crucial to maximizing service efficiency under these constraints. However, implementing these strategies often depends on sophisticated partitioning and parallelism approaches.¶
In distributed training, model partitioning and parallelism play a pivotal role. By employing techniques like tensor slicing or computing power pipelining, models can be decomposed and distributed across multiple nodes, with each node handling specific submodules or slices. This approach is particularly effective in training scenarios, where workload distribution ensures that no single server becomes a bottleneck. Similarly, in inference scenarios, input data can flow through a sequence of model microservices in a pipelined processing framework, which helps to maximize the utilization of scattered computing resources.¶
Despite the potential benefits, these strategies are not without challenges. Distributed training inherently requires efficient synchronization and communication between nodes, which are further constrained by network resource availability. Moreover, achieving balance between computational and network resources demands meticulous planning and real-time adaptation. While model partitioning and parallel execution help alleviate pressure on individual servers and utilize idle nodes more effectively, they also add layers of complexity to resource coordination in distributed environments.¶
In distributed systems, large-scale data is often divided into multiple blocks that are stored and processed separately. While this improves data security and processing efficiency, it introduces significant challenges for data collaboration. When multiple nodes or microservice modules need to share or exchange data, strict coordination is required, including defining interfaces and call sequences in advance and managing consistency and concurrency control. The complexity increases further when cross-node dependencies exist between data blocks, making the scheduling, loading, and distribution of data one of the primary bottlenecks for system scalability and computational efficiency.¶
A key difficulty lies in synchronizing data across distributed nodes while minimizing latency and avoiding bottlenecks. Cross-node dependencies require precise scheduling to ensure data arrives at the correct location and time without conflicts. As the scale of data and the number of nodes grow, the management overhead for maintaining these dependencies can increase exponentially, particularly when network bandwidth or latency constraints exacerbate delays. Additionally, ensuring data consistency across multiple data blocks during concurrent access or updates adds another layer of complexity. High levels of concurrency can increase the risk of inconsistencies, data races, and synchronization issues, demanding advanced mechanisms to enforce data integrity.¶
Traditional distributed communication strategies, such as AllReduce and All-to-All, are widely used and remain effective in addressing certain data collaboration needs in training and inference tasks. For example, AllReduce is well-suited for data parallel scenarios, where all nodes compute on the same model with different data splits, and gradients or weights are synchronized via aggregation and broadcast. Similarly, All-to-All is valuable in more complex distributed tasks that require frequent intermediate data exchanges across nodes. However, these methods are not without limitations. As data and system complexity grow, they can lead to increased communication overhead, especially in scenarios where synchronization is uneven or poorly timed.¶
The effectiveness of traditional methods depends on careful tuning and precise execution. Poorly timed data exchanges can result in prolonged waiting times, underutilized resources, or even data mismatches. Although methods like AllReduce and All-to-All offer reliable frameworks for communication, their scalability and efficiency are often constrained by the challenges of cross-node synchronization, network variability, and system heterogeneity. These limitations highlight the need for continuous refinement and innovation in distributed communication and data collaboration strategies to overcome the challenges posed by block isolation.¶
4.1. Service layer¶
The service layer serves as the central hub of a distributed AI system, connecting the front-end, business logic, and microservices to enable efficient interaction and seamless workflows. It hosts the core service logic, processes user and business requests, and coordinates the collaboration of multiple components.¶
At the front-end layer, the service layer interacts with user-facing interfaces, which handle tasks such as user authentication, data input, and result presentation. These interfaces act as the entry point for system requests, routing them through APIs provided by a service gateway. The gateway manages external request routing, authentication, protocol translation, and load balancing to ensure smooth and efficient communication between the user interface and the back-end services.¶
TBD¶