Switch-Transparent Load Balancing for RDMA Data Centers: A Host-Only Approach

IEEE/ACM International Symposium on Quality of Service (IWQoS) 2026

2026年06月08日

"Remote Direct Memory Access (RDMA) equips modern data centers (DCs) with high-performance data trans- mission. Yet, its potential is often unmet, as the State-Of-The-Art (SOTA) RDMA traffic load balancers struggle to balance high- performance with practicability. In this paper, we first analyse existing RDMA-aware load balancers, examining their trade-offs across four key metrics: path utilization, packet order preserva- tion, implementation overhead, and deployment complexity. We then present HP3, a host-based load balancer designed to excel in all these dimensions. HP3 combines Proactive Path Perception for real-time path state awareness with Reorder Free Rerouting to guarantee in-order packet delivery. It maintains low intrusiveness by complying with the standard RDMA transport layer and activating only during severe congestion. Implemented purely on end-hosts with no switch modifications, HP3 offers a practical solution with modest additional overhead. We prototype HP3 using DPDK and commodity RNICs, and evaluate it on a P4- programmable switch testbed. Experiments with diverse data center workloads show that HP3 outperforms SOTA RDMA load balancers by up to 3.4× for typical traffic and 1.3× for AI training traffic."

PRESENTATION

Good morning! It’s my great honor to stand here today and share our recent research. The topic of my presentation is HP3: Switch-Transparent Load Balancing for RDMA Data Centers — A Host-Only Approach. Before diving into our design and experiments, I’d like to start with the background and practical challenges of RDMA technology in modern data centers.

Nowadays, with excellent performance featuring low latency, high bandwidth and low CPU overhead brought by kernel bypass and zero-copy transmission, RDMA has been widely deployed in data centers adopting the mainstream Clos architecture to support high-performance services such as cloud computing and large model training. Nevertheless, traditional load balancing mechanisms severely limit the multi-path advantages of the Clos topology, preventing RDMA from fully unleashing its performance potential in large-scale data centers, and the core bottleneck lies in load balancing.

We conducted a series of simulation experiments to reveal the defects of existing RDMA load balancing schemes, as illustrated in the following figures. Figure 2 compares the link throughput imbalance of different methods under various workloads, and Figure 3 presents their normalized goodput. We can see that ECMP and LetFlow adopt static routing or flowlet-based scheduling, resulting in extremely high throughput imbalance. The link load is highly uneven, most multipath resources are left idle, and the overall bandwidth utilization is poor. In contrast, fine-grained per-packet scheduling schemes like RPS and AR can distribute traffic evenly across links and achieve perfect load balancing. However, frequent path switching leads to severe packet reordering. RDMA regards out-of-order packets as packet loss and triggers a large number of spurious retransmissions, which drastically reduces the actual goodput. In conclusion, traditional RDMA load balancing schemes have obvious limitations and fail to achieve a good balance among load distribution, in-order packet delivery and overall network performance.

Faced with this series of contradictions and limitations, we put forward the core research question of this work: Is it possible to design a RDMA load balancer that needs no modification to commercial switches, runs purely on end hosts with only moderate extra overhead, and simultaneously realizes high path utilization, strict packet order preservation and easy deployment? After repeated research, design and verification, our answer is yes, and that is the HP3 system we proposed.

Next, I will elaborate on the overall architecture, core modules and working mechanism of HP3.

HP3 is a host-only load balancing system transparent to switches. It runs entirely on application servers and is fully compatible with standard RDMA protocols and existing commercial switches. The system consists of two closely coordinated core modules: Proactive Path Perception (PPP) and Reorder Free Rerouting (RFR). The PPP module monitors link congestion in real time and finds relatively better paths through detection, while the RFR module switches traffic and ensures in-order packet delivery. The two modules work together to complete path monitoring, alternative path detection and seamless traffic migration.

First of all, let’s introduce the Proactive Path Perception module, or PPP. The main responsibilities of PPP are real-time path status monitoring and dynamic alternative path discovery. It works in two phases: Incumbent Path Monitoring and Alternative Path Discovery.

In the path monitoring stage, PPP keeps analyzing acknowledgment packets from the receiver and pays close attention to CNPs (Congestion Notification Packets). We adopt a proper threshold for congestion cycles. When CNPs appear across successive monitoring cycles, it means the standard RDMA congestion control algorithm DCQCN fails to relieve persistent congestion, and the current link is heavily congested. Here is how it works. We conduct monitoring periodically. On the diagram, green and orange blocks are incoming feedback packets including ACKs and CNPs: green for ACKs, and orange for CNPs. The numbers below mark each monitoring cycle. To give an example: a CNP received in Cycle 1 sets the consecutive congestion cycle count to 1. If there is no CNP in Cycle 2, the count resets to zero. When the count finally hits 3, HP3 starts to detect alternative paths.

During the alternative path detection phase, the system generates a set of lightweight probe packets. By assigning different UDP source ports to these probes, it leverages the native ECMP hashing mechanism of switches to distribute the packets across diverse physical paths. As illustrated in the figure, Host A is the sender and Host B acts as the receiver. The PPP module on Host A creates probe packets with distinct source port numbers. In this example, three probes numbered 1, 2 and 3 are generated and assigned unique ports, so that they can be hashed onto different paths. The Packet Sequence Number (PSN) of every probe packet is deliberately set to a value higher than the PSN currently expected by the receiver. For example, probe packet 2 here，in compliance with standard RDMA protocols, the receiver will automatically return a Negative Acknowledgment (NACK) once it detects abnormal sequence number jumps. This is the core principle that guarantees probes can stably trigger NACK responses. Meanwhile, the receiver transmits the corresponding source port information of each probe within the NACK packets. By parsing the first arriving NACK, we can quickly identify the path with the lowest latency and lightest load. Furthermore, HP3 makes full use of the idle period during RDMA connection establishment. After the RDMA Queue Pair (QP) completes handshaking and enters the ready-to-send state, upper-layer applications usually perform memory registration. We convert this inherent waiting delay into a detection window to finish path probing in advance. In this way, official data flows can select the preferable path from the very beginning, completely avoiding blind route selection at the initial stage of transmission.

The second core module is Reorder Free Rerouting, RFR, which is the key to solving the packet reordering problem. Based on the path quality results fed back by PPP, RFR adopts two targeted strategies: Conservative Switching and Fast Recovery.

When PPP finds a high-quality uncongested alternative path, RFR executes the Conservative Switching strategy. We will not switch the path immediately. Instead, we keep waiting until all in-flight packets on the original path receive confirmation from the receiver. Only after the original path is completely emptied, do we migrate all subsequent data traffic to the new optimal path. This mechanism fundamentally ensures the in-order delivery of packets and completely eliminates reordering risks. Meanwhile, we appropriately raise the sending rate to fully utilize the bandwidth of the new idle path.

If the probing result shows that all alternative paths are also congested, or the original path is still the best choice, RFR will trigger the Fast Recovery strategy. We resume data transmission on the original path immediately, roll back the sending rate to the state before suspension, and stop all path probing behaviors. This can effectively prevent frequent path switching, avoid network oscillation, and ensure the stable operation of the whole network.

After finishing the design, we implemented a complete HP3 prototype based on DPDK, and also conducted large-scale simulation experiments based on NS-3.

The experimental results fully verify the superiority of HP3. In hardware micro-benchmark tests, when hash collision causes link congestion, HP3 can quickly complete path migration. After HP3 is enabled, the ECN marking ratio drops to zero, the RTT jitter is completely eliminated and stabilized at a low level, and the throughput of competing flows is fully restored to the physical line rate. The Flow Completion Time is reduced by nearly half, and the transmission efficiency is significantly improved.

In large-scale simulation experiments of general data center workloads, HP3 achieves the lowest average FCT slowdown and 99th-percentile FCT slowdown under different traffic loads. Compared with existing optimal schemes, HP3 improves overall performance by up to 3.4 times. For AI training workloads with strong burstiness and synchronization, HP3 also shows outstanding adaptability. For AllReduce and AllToAll scenarios which are critical for large model training, HP3 reduces Job Completion Time obviously, with performance improvement reaching 1.3 times compared with contrast schemes. At the same time, HP3 effectively suppresses packet reordering, avoids a large amount of redundant retransmission traffic, and improves the effective utilization of bandwidth.

Finally, on behalf of our whole team, I would like to thank all the teachers and experts for your listening. We also welcome everyone to put forward valuable comments and questions. Thank you very much!

Switch-Transparent Load Balancing for RDMA Data Centers: A Host-Only Approach

PRESENTATION

中文译文

中文译文

中文译文

中文译文

中文译文

中文译文

中文译文

中文译文

中文译文

中文译文

中文译文

中文译文

中文译文

中文译文

中文译文

中文译文