Deadlock-Free Hybrid RL-MAPF Framework for Zero-Shot Multi-Robot Navigation

Overview

We study multi-robot navigation in cluttered environments where purely reactive policies often stall at bottlenecks such as doorways and single-lane corridors. Our framework keeps a decentralized RL navigator as the default controller and only hands over a small, locally selected group of agents to a MAPF solver when a deadlock is detected. After resolving the stalemate on a cropped grid with Push-and-Rotate (PnR), agents return to the RL mode and continue tracking their global waypoints.

RL+MAPF overview.

Abstract

Multi-robot navigation in cluttered environments must strike a balance between short-horizon collision avoidance and reliable long-range goal completion. In narrow passages and confined spaces, teams controlled by reactive policies (including RL-based ones) are prone to topological deadlocks: groups of agents engage in reciprocal avoidance but make no net progress toward their goals.

We propose a hybrid framework that couples a decentralized RL navigator with an on-demand Multi-Agent Path Finding (MAPF) module. A lightweight safety layer monitors local kinematics and waypoint progress to detect persistent standstills. When a deadlock is identified, we crop a small grid around the implicated agents, instantiate a locally confined MAPF instance, and invoke Push-and-Rotate to generate a short dense waypoint list for those agents only. The same RL policy then tracks these dense waypoints while handling local interactions; non-participating agents remain in the RL mode.

Across dense doorway and corridor benchmarks, this triggered hybrid reliably turns stalled interactions into successful completions, boosting team success rates from marginal to near-universal while keeping coordination overhead localized. The framework can be plugged into existing RL-based navigators with minimal changes and extends naturally to hierarchical task-planning pipelines for heterogeneous robot teams.

Method

Hybrid RL + MAPF Workflow

Our controller follows a layered design. By default, each agent runs a decentralized RL policy inspired by RVO-shaped state and reward design: the observation encodes VO/RVO geometry to nearby agents and obstacles, and a BiGRU-based encoder aggregates a variable-sized neighbor set into a compact state that is processed by PPO-trained actor–critic heads.

For long-range guidance, a grid-based planner (e.g., A*) generates sparse waypoint lists that serve as global references. The RL policy tracks these waypoints while performing reciprocal collision avoidance. This global-waypoints + local RL pattern works well in open regions but can stall when agents must be reordered topologically (e.g., door exchanges, two-way traffic in a single-lane corridor).

To handle such cases, we attach a deadlock detector and a locally confined MAPF module as a wrapper around the base RL navigator, without modifying the low-level policy itself.

Placeholder for the hybrid RL+MAPF workflow figure or an animated GIF showing the detect → crop → coordinate → handover pipeline.

Deadlock Detection

The deadlock detector monitors local motion statistics and reciprocal interactions. It combines three complementary triggers:

Speed / non-progress trigger: flags mutual slow-down with negligible goal-directed motion among neighboring agents.
Waypoint-stuck trigger: detects long-horizon stagnation when an agent’s active waypoint index has not advanced for a prolonged window.
Core-pair risk trigger: uses short-horizon TTC / minimum-distance proxies to identify mutually “most-at-risk” pairs that seed the minimal coordination set.

A deadlock is declared when any trigger fires under appropriate gating. Starting from a seed agent, we form a small participant group by adding its most-at-risk counterpart and a tiny local closure of neighboring agents involved in the stall. Only this group will be escalated to MAPF; all other agents remain in the RL mode.

Locally Confined MAPF with Push-and-Rotate

Once a deadlock is detected, we crop a subgrid around the participant group, project their current poses and active global waypoints to grid cells, and form a discrete MAPF instance. This instance is solved using Push-and-Rotate (PnR), a complete, polynomial-time MAPF algorithm under mild blank-space assumptions. The joint discrete plan is then converted into short dense waypoint lists for each participating agent.

During the coordination phase, the same RL policy tracks these dense waypoints under a mild speed cap, providing robustness to model mismatch. Once the dense segments are executed and the stalemate is cleared, agents seamlessly switch back to the default RL navigation and resume following their original global waypoint lists.

Results

We evaluate the framework in two challenging static 2D scenarios designed to stress long-range progress under bottlenecks: a doorway environment with a short bottleneck and a corridor environment with a long single-lane passage. In both cases, agents start on opposite sides with symmetric start–goal pairs, creating dense bidirectional traffic.

Across a range of team sizes, the hybrid RL+MAPF controller dramatically improves success rates compared to the RL-only baseline. The GIFs below visualize typical runs in both scenes, contrasting a purely RL-based navigator (often stuck in deadlock) with the proposed hybrid controller.

Doorway Scenario

Hybrid RL+MAPF in the doorway scenario. When agents stall at the bottleneck, a small subset is lifted to a local MAPF instance; after executing the PnR-generated dense waypoints, all agents clear the doorway and reach their goals (Status: SUCCESS).

Doorway scenario: RL-only policy falling into a deadlock.

RL-only baseline in the same doorway scenario. Agents perform reciprocal avoidance but fail to break the deadlock at the bottleneck and eventually time out (Status: FAIL).

Corridor Scenario

Hybrid RL+MAPF in the corridor scenario with two-way single-lane traffic. Occasional local MAPF calls reorder a handful of agents, restoring flow and enabling almost all agents to reach their goals (Status: SUCCESS).

Corridor scenario: RL-only policy stuck in long-lived deadlock.

RL-only baseline in the corridor scenario. Without explicit coordination, agents form persistent head-on conflicts in the single-lane passage and cannot resolve the deadlock before the time limit (Status: FAIL).

BibTeX

If you find this work useful, please cite:

@article{wang2025deadlockfree,
  title   = {Deadlock-Free Hybrid RL-MAPF Framework for Zero-Shot Multi-Robot Navigation},
  author  = {Wang, Haoyi and Luo, Licheng and Kantaros, Yiannis and Sinopoli, Bruno and Cai, Mingyu},
  journal = {arXiv preprint arXiv:2511.22685},
  year    = {2025},
  doi     = {10.48550/arXiv.2511.22685}
}