IEEE/ACM International Symposium on Quality of Service (IWQoS) 2026
2026年06月08日

"Remote Direct Memory Access (RDMA) equips modern data centers (DCs) with high-performance data trans- mission. Yet, its potential is often unmet, as the State-Of-The-Art (SOTA) RDMA traffic load balancers struggle to balance high- performance with practicability. In this paper, we first analyse existing RDMA-aware load balancers, examining their trade-offs across four key metrics: path utilization, packet order preserva- tion, implementation overhead, and deployment complexity. We then present HP3, a host-based load balancer designed to excel in all these dimensions. HP3 combines Proactive Path Perception for real-time path state awareness with Reorder Free Rerouting to guarantee in-order packet delivery. It maintains low intrusiveness by complying with the standard RDMA transport layer and activating only during severe congestion. Implemented purely on end-hosts with no switch modifications, HP3 offers a practical solution with modest additional overhead. We prototype HP3 using DPDK and commodity RNICs, and evaluate it on a P4- programmable switch testbed. Experiments with diverse data center workloads show that HP3 outperforms SOTA RDMA load balancers by up to 3.4× for typical traffic and 1.3× for AI training traffic."
Good morning! It’s my great honor to stand here today and share our recent research. The topic of my presentation is HP3: Switch-Transparent Load Balancing for RDMA Data Centers — A Host-Only Approach. Before diving into our design and experiments, I’d like to start with the background and practical challenges of RDMA technology in modern data centers.
Nowadays, cloud computing, large-scale artificial intelligence training, distributed storage and real-time big data services have become the mainstream workloads of modern data centers. All these applications put forward extremely demanding requirements for network performance: ultra-low latency, high throughput and low CPU consumption. Under such trends, Remote Direct Memory Access, namely RDMA, has gradually become a foundational network technology for high-performance data centers. Different from traditional TCP/IP networks, RDMA adopts kernel bypass and zero-copy data transmission. It directly completes data interaction between the memory of two end hosts without the participation of the operating system kernel. This mechanism enables RDMA to achieve microsecond-level end-to-end latency, line-rate bandwidth utilization and negligible CPU overhead on servers. Therefore, RDMA has been widely deployed in the infrastructure of major cloud vendors and AI computing clusters, supporting core businesses such as large language model training, distributed file systems and in-memory databases.
However, despite its outstanding performance advantages, RDMA still cannot fully release its potential in today’s large-scale data centers. Most modern data centers adopt the classic Clos topology, which is a multi-path network architecture designed to provide abundant redundant links and strong path diversity. The original intention of this design is to fully utilize all network links and improve the overall network capacity. But in actual RDMA deployment, this multi-path advantage is severely restricted by traditional traffic load balancing mechanisms, and the core bottleneck lies in load balancing.
The most widely used load balancing scheme in current data centers is ECMP, Equal-Cost Multi-Path routing. ECMP relies on fixed hash rules to distribute all packets belonging to a single flow onto one exclusive path. This method can naturally guarantee packet order and requires no modification to existing switches, so it has excellent compatibility. But its fatal flaw is the lack of dynamic congestion awareness. Once multiple hot flows are hashed to the same link, hash collisions occur. Some links will be overloaded and fall into continuous congestion, while a large number of idle alternative paths are left unused. This not only causes serious waste of network bandwidth resources, but also sharply increases queuing delay on congested links. Worse still, persistent congestion will trigger PFC, Priority Flow Control, and even lead to PFC deadlocks, which directly destroy the low-latency feature of RDMA. This problem becomes particularly prominent in skewed traffic scenarios: for example, parameter synchronization during large-scale AI training, and hotspot data access in distributed storage clusters. In these scenarios, traffic distribution is extremely unbalanced, and the performance of ECMP drops dramatically, greatly weakening the practical value of RDMA.
To solve the above pain points, academia and the industry have proposed a variety of RDMA-oriented load balancing solutions in recent years. After sorting out and analyzing existing state-of-the-art schemes, we find that they all fall into a typical dilemma: performance and practicality cannot be achieved at the same time.
On the one hand, many high-performance RDMA load balancers such as DCP, ConWeave, MORS and MP-RDMA can effectively improve path utilization and relieve congestion. But they all have high deployment barriers. Some require upgrading programmable switches and adding complex in-network logic; some need to redesign the internal stack of RNIC, the RDMA network interface card, which involves hardware revision and long-term compatibility adaptation; others depend on real-time network telemetry systems, which greatly increases operation and maintenance complexity. All these factors make these solutions difficult to be popularized in commercial mass-produced data centers.
On the other hand, some lightweight fine-grained load balancing schemes like RPS, AR and DRILL split traffic at the per-packet level. They can well disperse load and balance link throughput, but they bring another critical problem: severe packet reordering. The transmission semantics of RDMA strictly require in-order packet delivery. When out-of-order packets appear, RNIC will misjudge them as packet loss and initiate a large number of unnecessary retransmissions. A large number of spurious retransmissions will occupy valid bandwidth, form NACK storms, and eventually cause the overall throughput to collapse. Besides, flowlet-based schemes such as LetFlow and CONGA are also not applicable to RDMA. Because RNIC will actively pace the sending of packets, there is almost no interval between consecutive packets. As a result, there is no available time window for flowlet splitting and path switching.
Faced with this series of contradictions and limitations, we put forward the core research question of this work: Is it possible to design a RDMA load balancer that needs no modification to commercial switches, runs purely on end hosts with only moderate extra overhead, and simultaneously realizes high path utilization, strict packet order preservation and easy deployment? After repeated research, design and verification, our answer is yes, and that is the HP3 system we proposed.
Next, I will elaborate on the overall architecture, core modules and working mechanism of HP3.
HP3 is a host-only, switch-transparent load balancing system. It runs completely on end servers and is fully compatible with standard RDMA protocols and existing commercial switches. Its core consists of two tightly coupled functional modules: Proactive Path Perception (PPP) and Reorder Free Rerouting (RFR). The two modules cooperate with each other to complete path monitoring, path probing and traffic migration step by step.
First of all, let’s introduce the Proactive Path Perception module, or PPP. The main responsibilities of PPP are real-time path status monitoring and dynamic alternative path discovery. It works in two phases: Incumbent Path Monitoring and Alternative Path Discovery.
In the Incumbent Path Monitoring phase, PPP continuously monitors ECN, Explicit Congestion Notification marks carried on ACK packets returned by the receiver. We set a congestion tolerance threshold. Only when we detect that ECN marks appear continuously for multiple monitoring cycles, which means the standard RDMA congestion control mechanism DCQCN has failed to relieve persistent congestion, will we judge that the current path is severely congested. At this moment, PPP will temporarily suspend data transmission to avoid further deterioration of congestion, and immediately start the Alternative Path Discovery phase.
In the Alternative Path Discovery phase, we generate a batch of lightweight probe packets. By configuring different UDP source ports for these probes, we use the native ECMP hash function of switches to distribute probe packets to different physical paths. The receiver will feed back path information through NACK packets. By analyzing the first returned NACK, we can quickly lock the path with the lowest latency and the lightest load. It is worth mentioning that HP3 makes full use of the idle time during RDMA connection establishment. When the QP, Queue Pair, completes connection handshake and enters the ready-to-send state, the upper-layer application usually performs memory registration operations. We turn this inherent latency into a probing window to complete path detection in advance, so that the official data flow can select the optimal path from the very beginning and avoid blind routing.
The second core module is Reorder Free Rerouting, RFR, which is the key to solving the packet reordering problem. Based on the path quality results fed back by PPP, RFR adopts two targeted strategies: Conservative Switching and Fast Recovery.
When PPP finds a high-quality uncongested alternative path, RFR executes the Conservative Switching strategy. We will not switch the path immediately. Instead, we keep waiting until all in-flight packets on the original path receive confirmation from the receiver. Only after the original path is completely emptied, do we migrate all subsequent data traffic to the new optimal path. This mechanism fundamentally ensures the in-order delivery of packets and completely eliminates reordering risks. Meanwhile, we appropriately raise the sending rate to fully utilize the bandwidth of the new idle path.
If the probing result shows that all alternative paths are also congested, or the original path is still the best choice, RFR will trigger the Fast Recovery strategy. We resume data transmission on the original path immediately, roll back the sending rate to the state before suspension, and stop all path probing behaviors. This can effectively prevent frequent path switching, avoid network oscillation, and ensure the stable operation of the whole network.
After finishing the design, we implemented a complete HP3 prototype based on DPDK and commercial off-the-shelf RNICs. We built a dual-layer Leaf-Spine Clos network test platform with P4 programmable switches, and also conducted large-scale simulation experiments based on NS-3. We selected multiple typical workloads to evaluate the performance, including general data center workloads such as AliCloud Storage, Meta Hadoop, Data Mining and Web Search, as well as classic AI collective communication workloads: AllReduce and AllToAll. We compared HP3 with mainstream schemes including ECMP, LetFlow, RPS, AR and state-of-the-art ConWeave.
The experimental results fully verify the superiority of HP3. In hardware micro-benchmark tests, when hash collision causes link congestion, HP3 can quickly complete path migration. After HP3 is enabled, the ECN marking ratio drops to zero, the RTT jitter is completely eliminated and stabilized at a low level, and the throughput of competing flows is fully restored to the physical line rate. The Flow Completion Time is reduced by nearly half, and the transmission efficiency is significantly improved.
In large-scale simulation experiments of general data center workloads, HP3 achieves the lowest average FCT slowdown and 99th-percentile FCT slowdown under different traffic loads. Compared with existing optimal schemes, HP3 improves overall performance by up to 3.4 times. For AI training workloads with strong burstiness and synchronization, HP3 also shows outstanding adaptability. For AllReduce and AllToAll scenarios which are critical for large model training, HP3 reduces Job Completion Time obviously, with performance improvement reaching 1.3 times compared with contrast schemes. At the same time, HP3 effectively suppresses packet reordering, avoids a large amount of redundant retransmission traffic, and improves the effective utilization of bandwidth.
For our future research plans, we have two main directions. First, we plan to offload the core logic of PPP and RFR modules to DPU, Data Processing Unit, to further reduce the CPU overhead of host servers. Second, we will design adaptive parameter tuning algorithms to make HP3 automatically adjust congestion thresholds and probing strategies according to dynamic network changes, so as to adapt to more complex and diverse large-scale cluster environments.
Finally, on behalf of our whole team, I would like to thank all the teachers and experts for your listening. We also welcome everyone to put forward valuable comments and questions. Thank you very much!