OCP 2025 Talk on Capacity-Aware Adaptive Routing

I recently spoke at OCP 2025 about Meta’s innovative solution that introduces remote-capacity-awareness to adaptive routing, which helps improve network performance for AI training. Here is a brief overview of the talk:

As backend networks continue to evolve and grow exponentially, Meta has been working on optimizing the routing design for its next-generation backend network architecture. Our current Adaptive Routing Scheme which relies on local load awareness, has limitations when applied to our larger topologies. The presence of parallel links and usage of shallow buffer switches can lead to congestion and decreased performance, particularly when remote links fail and reduce capacity without losing reachability.

To address these challenges, we have developed a new approach that combines remote capacity information with local-congestion-aware spray. This solution dynamically adjusts the load balancing scheme based on end-to-end network capacity, ensuring more efficient traffic distribution and minimizing the impact of link failures on network performance. We also developed algorithms for efficiently merging ARS groups and leveraged non load aware ECMP spray groups to address scaling challenges.

Recording of the full talk is available on Youtube.