Robotics 53
☆ SimuRA: Towards General Goal-Oriented Agent via Simulative Reasoning Architecture with LLM-Based World Model
AI agents built on large language models (LLMs) hold enormous promise, but
current practice focuses on a one-task-one-agent approach, which not only falls
short of scalability and generality, but also suffers from the fundamental
limitations of autoregressive LLMs. On the other hand, humans are general
agents who reason by mentally simulating the outcomes of their actions and
plans. Moving towards a more general and powerful AI agent, we introduce
SimuRA, a goal-oriented architecture for generalized agentic reasoning. Based
on a principled formulation of optimal agent in any environment, \modelname
overcomes the limitations of autoregressive reasoning by introducing a world
model for planning via simulation. The generalized world model is implemented
using LLM, which can flexibly plan in a wide range of environments using the
concept-rich latent space of natural language. Experiments on difficult web
browsing tasks show that \modelname improves the success of flight search from
0\% to 32.2\%. World-model-based planning, in particular, shows consistent
advantage of up to 124\% over autoregressive planning, demonstrating the
advantage of world model simulation as a reasoning paradigm. We are excited
about the possibility for training a single, general agent model based on LLMs
that can act superintelligently in all environments. To start, we make SimuRA,
a web-browsing agent built on \modelname with pretrained LLMs, available as a
research demo for public testing.
☆ Distributed AI Agents for Cognitive Underwater Robot Autonomy
Achieving robust cognitive autonomy in robots navigating complex,
unpredictable environments remains a fundamental challenge in robotics. This
paper presents Underwater Robot Self-Organizing Autonomy (UROSA), a
groundbreaking architecture leveraging distributed Large Language Model AI
agents integrated within the Robot Operating System 2 (ROS 2) framework to
enable advanced cognitive capabilities in Autonomous Underwater Vehicles. UROSA
decentralises cognition into specialised AI agents responsible for multimodal
perception, adaptive reasoning, dynamic mission planning, and real-time
decision-making. Central innovations include flexible agents dynamically
adapting their roles, retrieval-augmented generation utilising vector databases
for efficient knowledge management, reinforcement learning-driven behavioural
optimisation, and autonomous on-the-fly ROS 2 node generation for runtime
functional extensibility. Extensive empirical validation demonstrates UROSA's
promising adaptability and reliability through realistic underwater missions in
simulation and real-world deployments, showing significant advantages over
traditional rule-based architectures in handling unforeseen scenarios,
environmental uncertainties, and novel mission objectives. This work not only
advances underwater autonomy but also establishes a scalable, safe, and
versatile cognitive robotics framework capable of generalising to a diverse
array of real-world applications.
☆ RAGNet: Large-scale Reasoning-based Affordance Segmentation Benchmark towards General Grasping ICCV 2025
Dongming Wu, Yanping Fu, Saike Huang, Yingfei Liu, Fan Jia, Nian Liu, Feng Dai, Tiancai Wang, Rao Muhammad Anwer, Fahad Shahbaz Khan, Jianbing Shen
General robotic grasping systems require accurate object affordance
perception in diverse open-world scenarios following human instructions.
However, current studies suffer from the problem of lacking reasoning-based
large-scale affordance prediction data, leading to considerable concern about
open-world effectiveness. To address this limitation, we build a large-scale
grasping-oriented affordance segmentation benchmark with human-like
instructions, named RAGNet. It contains 273k images, 180 categories, and 26k
reasoning instructions. The images cover diverse embodied data domains, such as
wild, robot, ego-centric, and even simulation data. They are carefully
annotated with an affordance map, while the difficulty of language instructions
is largely increased by removing their category name and only providing
functional descriptions. Furthermore, we propose a comprehensive
affordance-based grasping framework, named AffordanceNet, which consists of a
VLM pre-trained on our massive affordance data and a grasping network that
conditions an affordance map to grasp the target. Extensive experiments on
affordance segmentation benchmarks and real-robot manipulation tasks show that
our model has a powerful open-world generalization ability. Our data and code
is available at https://github.com/wudongming97/AffordanceNet.
comment: Accepted by ICCV 2025. The code is at
https://github.com/wudongming97/AffordanceNet
☆ Design of a bioinspired robophysical antenna for insect-scale tactile perception and navigation
Parker McDonnell, Lingsheng Meng, Hari Krishna Hariprasad, Alexander Hedrick, Eduardo Miscles, Samuel Gilinsky, Jean-Michel Mongeau, Kaushik Jayaram
The American cockroach (Periplaneta americana) uses its soft antennae to
guide decision making by extracting rich tactile information from tens of
thousands of distributed mechanosensors. Although tactile sensors enable
robust, autonomous perception and navigation in natural systems, replicating
these capabilities in insect-scale robots remains challenging due to stringent
size, weight, and power constraints that limit existing sensor technologies. To
overcome these limitations, we introduce CITRAS (Cockroach Inspired Tactile
Robotic Antenna Sensor), a bioinspired, multi-segmented, compliant laminate
sensor with embedded capacitive angle sensors. CITRAS is compact (73.7x15.6x2.1
mm), lightweight (491 mg), and low-power (32 mW), enabling seamless integration
with miniature robotic platforms. The segmented compliant structure passively
bends in response to environmental stimuli, achieving accurate hinge angle
measurements with maximum errors of just 0.79 degree (quasistatic bending) and
3.58 degree (dynamic bending). Experimental evaluations demonstrate CITRAS'
multifunctional tactile perception capabilities: predicting base-to-tip
distances with 7.75 % error, estimating environmental gap widths with 6.73 %
error, and distinguishing surface textures through differential sensor
response. The future integration of this bioinspired tactile antenna in
insect-scale robots addresses critical sensing gaps, promising enhanced
autonomous exploration, obstacle avoidance, and environmental mapping in
complex, confined environments.
☆ Scalable Multi-Task Reinforcement Learning for Generalizable Spatial Intelligence in Visuomotor Agents
While Reinforcement Learning (RL) has achieved remarkable success in language
modeling, its triumph hasn't yet fully translated to visuomotor agents. A
primary challenge in RL models is their tendency to overfit specific tasks or
environments, thereby hindering the acquisition of generalizable behaviors
across diverse settings. This paper provides a preliminary answer to this
challenge by demonstrating that RL-finetuned visuomotor agents in Minecraft can
achieve zero-shot generalization to unseen worlds. Specifically, we explore
RL's potential to enhance generalizable spatial reasoning and interaction
capabilities in 3D worlds. To address challenges in multi-task RL
representation, we analyze and establish cross-view goal specification as a
unified multi-task goal space for visuomotor policies. Furthermore, to overcome
the significant bottleneck of manual task design, we propose automated task
synthesis within the highly customizable Minecraft environment for large-scale
multi-task RL training, and we construct an efficient distributed RL framework
to support this. Experimental results show RL significantly boosts interaction
success rates by $4\times$ and enables zero-shot generalization of spatial
reasoning across diverse environments, including real-world settings. Our
findings underscore the immense potential of RL training in 3D simulated
environments, especially those amenable to large-scale task generation, for
significantly advancing visuomotor agents' spatial reasoning.
☆ villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models
Xiaoyu Chen, Hangxing Wei, Pushi Zhang, Chuheng Zhang, Kaixin Wang, Yanjiang Guo, Rushuai Yang, Yucen Wang, Xinquan Xiao, Li Zhao, Jianyu Chen, Jiang Bian
Visual-Language-Action (VLA) models have emerged as a popular paradigm for
learning robot manipulation policies that can follow language instructions and
generalize to novel scenarios. Recent work has begun to explore the
incorporation of latent actions, an abstract representation of visual change
between two frames, into VLA pre-training. In this paper, we introduce villa-X,
a novel Visual-Language-Latent-Action (ViLLA) framework that advances latent
action modeling for learning generalizable robot manipulation policies. Our
approach improves both how latent actions are learned and how they are
incorporated into VLA pre-training. Together, these contributions enable
villa-X to achieve superior performance across simulated environments including
SIMPLER and LIBERO, as well as on two real-world robot setups including gripper
and dexterous hand manipulation. We believe the ViLLA paradigm holds
significant promise, and that our villa-X provides a strong foundation for
future research.
comment: Project page: https://aka.ms/villa-x
☆ Stereo 3D Gaussian Splatting SLAM for Outdoor Urban Scenes
3D Gaussian Splatting (3DGS) has recently gained popularity in SLAM
applications due to its fast rendering and high-fidelity representation.
However, existing 3DGS-SLAM systems have predominantly focused on indoor
environments and relied on active depth sensors, leaving a gap for large-scale
outdoor applications. We present BGS-SLAM, the first binocular 3D Gaussian
Splatting SLAM system designed for outdoor scenarios. Our approach uses only
RGB stereo pairs without requiring LiDAR or active sensors. BGS-SLAM leverages
depth estimates from pre-trained deep stereo networks to guide 3D Gaussian
optimization with a multi-loss strategy enhancing both geometric consistency
and visual quality. Experiments on multiple datasets demonstrate that BGS-SLAM
achieves superior tracking accuracy and mapping performance compared to other
3DGS-based solutions in complex outdoor environments.
★ DuLoc: Life-Long Dual-Layer Localization in Changing and Dynamic Expansive Scenarios
LiDAR-based localization serves as a critical component in autonomous
systems, yet existing approaches face persistent challenges in balancing
repeatability, accuracy, and environmental adaptability. Traditional point
cloud registration methods relying solely on offline maps often exhibit limited
robustness against long-term environmental changes, leading to localization
drift and reliability degradation in dynamic real-world scenarios. To address
these challenges, this paper proposes DuLoc, a robust and accurate localization
method that tightly couples LiDAR-inertial odometry with offline map-based
localization, incorporating a constant-velocity motion model to mitigate
outlier noise in real-world scenarios. Specifically, we develop a LiDAR-based
localization framework that seamlessly integrates a prior global map with
dynamic real-time local maps, enabling robust localization in unbounded and
changing environments. Extensive real-world experiments in ultra unbounded port
that involve 2,856 hours of operational data across 32 Intelligent Guided
Vehicles (IGVs) are conducted and reported in this study. The results attained
demonstrate that our system outperforms other state-of-the-art LiDAR
localization systems in large-scale changing outdoor environments.
☆ DRACo-SLAM2: Distributed Robust Acoustic Communication-efficient SLAM for Imaging Sonar EquippedUnderwater Robot Teams with Object Graph Matching
We present DRACo-SLAM2, a distributed SLAM framework for underwater robot
teams equipped with multibeam imaging sonar. This framework improves upon the
original DRACo-SLAM by introducing a novel representation of sonar maps as
object graphs and utilizing object graph matching to achieve time-efficient
inter-robot loop closure detection without relying on prior geometric
information. To better-accommodate the needs and characteristics of underwater
scan matching, we propose incremental Group-wise Consistent Measurement Set
Maximization (GCM), a modification of Pairwise Consistent Measurement Set
Maximization (PCM), which effectively handles scenarios where nearby
inter-robot loop closures share similar registration errors. The proposed
approach is validated through extensive comparative analyses on simulated and
real-world datasets.
☆ Human-Exoskeleton Kinematic Calibration to Improve Hand Tracking for Dexterous Teleoperation
Haiyun Zhang, Stefano Dalla Gasperina, Saad N. Yousaf, Toshimitsu Tsuboi, Tetsuya Narita, Ashish D. Deshpande
Hand exoskeletons are critical tools for dexterous teleoperation and
immersive manipulation interfaces, but achieving accurate hand tracking remains
a challenge due to user-specific anatomical variability and donning
inconsistencies. These issues lead to kinematic misalignments that degrade
tracking performance and limit applicability in precision tasks. We propose a
subject-specific calibration framework for exoskeleton-based hand tracking that
uses redundant joint sensing and a residual-weighted optimization strategy to
estimate virtual link parameters. Implemented on the Maestro exoskeleton, our
method improves joint angle and fingertip position estimation across users with
varying hand geometries. We introduce a data-driven approach to empirically
tune cost function weights using motion capture ground truth, enabling more
accurate and consistent calibration across participants. Quantitative results
from seven subjects show substantial reductions in joint and fingertip tracking
errors compared to uncalibrated and evenly weighted models. Qualitative
visualizations using a Unity-based virtual hand further confirm improvements in
motion fidelity. The proposed framework generalizes across exoskeleton designs
with closed-loop kinematics and minimal sensing, and lays the foundation for
high-fidelity teleoperation and learning-from-demonstration applications.
comment: 8 pages, 10 figures, submitted to RA-L
☆ Can LLM-Reasoning Models Replace Classical Planning? A Benchmark Study
Recent advancements in Large Language Models have sparked interest in their
potential for robotic task planning. While these models demonstrate strong
generative capabilities, their effectiveness in producing structured and
executable plans remains uncertain. This paper presents a systematic evaluation
of a broad spectrum of current state of the art language models, each directly
prompted using Planning Domain Definition Language domain and problem files,
and compares their planning performance with the Fast Downward planner across a
variety of benchmarks. In addition to measuring success rates, we assess how
faithfully the generated plans translate into sequences of actions that can
actually be executed, identifying both strengths and limitations of using these
models in this setting. Our findings show that while the models perform well on
simpler planning tasks, they continue to struggle with more complex scenarios
that require precise resource management, consistent state tracking, and strict
constraint compliance. These results underscore fundamental challenges in
applying language models to robotic planning in real world environments. By
outlining the gaps that emerge during execution, we aim to guide future
research toward combined approaches that integrate language models with
classical planners in order to enhance the reliability and scalability of
planning in autonomous robotics.
☆ Impact of a Lower Limb Exosuit Anchor Points on Energetics and Biomechanics
Chiara Lambranzi, Giulia Oberti, Christian Di Natali, Darwin G. Caldwell, Manuela Galli, Elena De Momi, Jesùs Ortiz
Anchor point placement is a crucial yet often overlooked aspect of exosuit
design since it determines how forces interact with the human body. This work
analyzes the impact of different anchor point positions on gait kinematics,
muscular activation and energetic consumption. A total of six experiments were
conducted with 11 subjects wearing the XoSoft exosuit, which assists hip
flexion in five configurations. Subjects were instrumented with an IMU-based
motion tracking system, EMG sensors, and a mask to measure metabolic
consumption. The results show that positioning the knee anchor point on the
posterior side while keeping the hip anchor on the anterior part can reduce
muscle activation in the hip flexors by up to 10.21\% and metabolic expenditure
by up to 18.45\%. Even if the only assisted joint was the hip, all the
configurations introduced changes also in the knee and ankle kinematics.
Overall, no single configuration was optimal across all subjects, suggesting
that a personalized approach is necessary to transmit the assistance forces
optimally. These findings emphasize that anchor point position does indeed have
a significant impact on exoskeleton effectiveness and efficiency. However,
these optimal positions are subject-specific to the exosuit design, and there
is a strong need for future work to tailor musculoskeletal models to individual
characteristics and validate these results in clinical populations.
comment: 12 pages, 10 figures
☆ User Experience Estimation in Human-Robot Interaction Via Multi-Instance Learning of Multimodal Social Signals IROS 2025
In recent years, the demand for social robots has grown, requiring them to
adapt their behaviors based on users' states. Accurately assessing user
experience (UX) in human-robot interaction (HRI) is crucial for achieving this
adaptability. UX is a multi-faceted measure encompassing aspects such as
sentiment and engagement, yet existing methods often focus on these
individually. This study proposes a UX estimation method for HRI by leveraging
multimodal social signals. We construct a UX dataset and develop a
Transformer-based model that utilizes facial expressions and voice for
estimation. Unlike conventional models that rely on momentary observations, our
approach captures both short- and long-term interaction patterns using a
multi-instance learning framework. This enables the model to capture temporal
dynamics in UX, providing a more holistic representation. Experimental results
demonstrate that our method outperforms third-party human evaluators in UX
estimation.
comment: This paper has been accepted for presentation at IEEE/RSJ
International Conference on Intelligent Robots and Systems 2025 (IROS 2025)
☆ A Unified Perception-Language-Action Framework for Adaptive Autonomous Driving
Autonomous driving systems face significant challenges in achieving
human-like adaptability, robustness, and interpretability in complex,
open-world environments. These challenges stem from fragmented architectures,
limited generalization to novel scenarios, and insufficient semantic extraction
from perception. To address these limitations, we propose a unified
Perception-Language-Action (PLA) framework that integrates multi-sensor fusion
(cameras, LiDAR, radar) with a large language model (LLM)-augmented
Vision-Language-Action (VLA) architecture, specifically a GPT-4.1-powered
reasoning core. This framework unifies low-level sensory processing with
high-level contextual reasoning, tightly coupling perception with natural
language-based semantic understanding and decision-making to enable
context-aware, explainable, and safety-bounded autonomous driving. Evaluations
on an urban intersection scenario with a construction zone demonstrate superior
performance in trajectory tracking, speed prediction, and adaptive planning.
The results highlight the potential of language-augmented cognitive frameworks
for advancing the safety, interpretability, and scalability of autonomous
driving systems.
☆ H-RDT: Human Manipulation Enhanced Bimanual Robotic Manipulation
Imitation learning for robotic manipulation faces a fundamental challenge:
the scarcity of large-scale, high-quality robot demonstration data. Recent
robotic foundation models often pre-train on cross-embodiment robot datasets to
increase data scale, while they face significant limitations as the diverse
morphologies and action spaces across different robot embodiments make unified
training challenging. In this paper, we present H-RDT (Human to Robotics
Diffusion Transformer), a novel approach that leverages human manipulation data
to enhance robot manipulation capabilities. Our key insight is that large-scale
egocentric human manipulation videos with paired 3D hand pose annotations
provide rich behavioral priors that capture natural manipulation strategies and
can benefit robotic policy learning. We introduce a two-stage training
paradigm: (1) pre-training on large-scale egocentric human manipulation data,
and (2) cross-embodiment fine-tuning on robot-specific data with modular action
encoders and decoders. Built on a diffusion transformer architecture with 2B
parameters, H-RDT uses flow matching to model complex action distributions.
Extensive evaluations encompassing both simulation and real-world experiments,
single-task and multitask scenarios, as well as few-shot learning and
robustness assessments, demonstrate that H-RDT outperforms training from
scratch and existing state-of-the-art methods, including Pi0 and RDT, achieving
significant improvements of 13.9% and 40.5% over training from scratch in
simulation and real-world experiments, respectively. The results validate our
core hypothesis that human manipulation data can serve as a powerful foundation
for learning bimanual robotic manipulation policies.
☆ Online Estimation of Table-Top Grown Strawberry Mass in Field Conditions with Occlusions IROS 2025
Accurate mass estimation of table-top grown strawberries under field
conditions remains challenging due to frequent occlusions and pose variations.
This study proposes a vision-based pipeline integrating RGB-D sensing and deep
learning to enable non-destructive, real-time and online mass estimation. The
method employed YOLOv8-Seg for instance segmentation, Cycle-consistent
generative adversarial network (CycleGAN) for occluded region completion, and
tilt-angle correction to refine frontal projection area calculations. A
polynomial regression model then mapped the geometric features to mass.
Experiments demonstrated mean mass estimation errors of 8.11% for isolated
strawberries and 10.47% for occluded cases. CycleGAN outperformed large mask
inpainting (LaMa) model in occlusion recovery, achieving superior pixel area
ratios (PAR) (mean: 0.978 vs. 1.112) and higher intersection over union (IoU)
scores (92.3% vs. 47.7% in the [0.9-1] range). This approach addresses critical
limitations of traditional methods, offering a robust solution for automated
harvesting and yield monitoring with complex occlusion patterns.
comment: Accepted by IROS 2025
☆ Quantifying and Visualizing Sim-to-Real Gaps: Physics-Guided Regularization for Reproducibility
Simulation-to-real transfer using domain randomization for robot control
often relies on low-gear-ratio, backdrivable actuators, but these approaches
break down when the sim-to-real gap widens. Inspired by the traditional PID
controller, we reinterpret its gains as surrogates for complex, unmodeled plant
dynamics. We then introduce a physics-guided gain regularization scheme that
measures a robot's effective proportional gains via simple real-world
experiments. Then, we penalize any deviation of a neural controller's local
input-output sensitivities from these values during training. To avoid the
overly conservative bias of naive domain randomization, we also condition the
controller on the current plant parameters. On an off-the-shelf two-wheeled
balancing robot with a 110:1 gearbox, our gain-regularized,
parameter-conditioned RNN achieves angular settling times in hardware that
closely match simulation. At the same time, a purely domain-randomized policy
exhibits persistent oscillations and a substantial sim-to-real gap. These
results demonstrate a lightweight, reproducible framework for closing
sim-to-real gaps on affordable robotic hardware.
☆ Policy Learning from Large Vision-Language Model Feedback without Reward Modeling IROS 2025
Offline reinforcement learning (RL) provides a powerful framework for
training robotic agents using pre-collected, suboptimal datasets, eliminating
the need for costly, time-consuming, and potentially hazardous online
interactions. This is particularly useful in safety-critical real-world
applications, where online data collection is expensive and impractical.
However, existing offline RL algorithms typically require reward labeled data,
which introduces an additional bottleneck: reward function design is itself
costly, labor-intensive, and requires significant domain expertise. In this
paper, we introduce PLARE, a novel approach that leverages large
vision-language models (VLMs) to provide guidance signals for agent training.
Instead of relying on manually designed reward functions, PLARE queries a VLM
for preference labels on pairs of visual trajectory segments based on a
language task description. The policy is then trained directly from these
preference labels using a supervised contrastive preference learning objective,
bypassing the need to learn explicit reward models. Through extensive
experiments on robotic manipulation tasks from the MetaWorld, PLARE achieves
performance on par with or surpassing existing state-of-the-art VLM-based
reward generation methods. Furthermore, we demonstrate the effectiveness of
PLARE in real-world manipulation tasks with a physical robot, further
validating its practical applicability.
comment: Accepted to IROS 2025
☆ Multi-Waypoint Path Planning and Motion Control for Non-holonomic Mobile Robots in Agricultural Applications
There is a growing demand for autonomous mobile robots capable of navigating
unstructured agricultural environments. Tasks such as weed control in meadows
require efficient path planning through an unordered set of coordinates while
minimizing travel distance and adhering to curvature constraints to prevent
soil damage and protect vegetation. This paper presents an integrated
navigation framework combining a global path planner based on the Dubins
Traveling Salesman Problem (DTSP) with a Nonlinear Model Predictive Control
(NMPC) strategy for local path planning and control. The DTSP generates a
minimum-length, curvature-constrained path that efficiently visits all targets,
while the NMPC leverages this path to compute control signals to accurately
reach each waypoint. The system's performance was validated through comparative
simulation analysis on real-world field datasets, demonstrating that the
coupled DTSP-based planner produced smoother and shorter paths, with a
reduction of about 16% in the provided scenario, compared to decoupled methods.
Based thereon, the NMPC controller effectively steered the robot to the desired
waypoints, while locally optimizing the trajectory and ensuring adherence to
constraints. These findings demonstrate the potential of the proposed framework
for efficient autonomous navigation in agricultural environments.
comment: 6 pages
☆ Learning to Drift with Individual Wheel Drive: Maneuvering Autonomous Vehicle at the Handling Limits
Drifting, characterized by controlled vehicle motion at high sideslip angles,
is crucial for safely handling emergency scenarios at the friction limits.
While recent reinforcement learning approaches show promise for drifting
control, they struggle with the significant simulation-to-reality gap, as
policies that perform well in simulation often fail when transferred to
physical systems. In this paper, we present a reinforcement learning framework
with GPU-accelerated parallel simulation and systematic domain randomization
that effectively bridges the gap. The proposed approach is validated on both
simulation and a custom-designed and open-sourced 1/10 scale Individual Wheel
Drive (IWD) RC car platform featuring independent wheel speed control.
Experiments across various scenarios from steady-state circular drifting to
direction transitions and variable-curvature path following demonstrate that
our approach achieves precise trajectory tracking while maintaining controlled
sideslip angles throughout complex maneuvers in both simulated and real-world
environments.
☆ Assessing the Alignment of Automated Vehicle Decisions with Human Reasons
A key challenge in deploying automated vehicles (AVs) is ensuring they make
appropriate decisions in ethically challenging everyday driving situations.
While much attention has been paid to rare, high-stakes dilemmas such as
trolley problems, similar tensions also arise in routine scenarios, such as
navigating empty intersections, where multiple human considerations, including
legality and comfort, often conflict. Current AV planning systems typically
rely on rigid rules, which struggle to balance these competing considerations
and can lead to behaviour that misaligns with human expectations. This paper
proposes a novel reasons-based trajectory evaluation framework that
operationalises the tracking condition of Meaningful Human Control (MHC). The
framework models the reasons of human agents, such as regulatory compliance, as
quantifiable functions and evaluates how well candidate AV trajectories align
with these reasons. By assigning adjustable weights to agent priorities and
integrating a balance function to discourage the exclusion of any agent, the
framework supports interpretable decision evaluation. Through a
real-world-inspired overtaking scenario, we show how this approach reveals
tensions, for instance between regulatory compliance, efficiency, and comfort.
The framework functions as a modular evaluation layer over existing planning
algorithms. It offers a transparent tool for assessing ethical alignment in
everyday scenarios and provides a practical step toward implementing MHC in
real-world AV deployment.
comment: This version incorporates revisions based on peer-review feedback
from a prior submission. The work has not yet been accepted and is being
prepared for resubmission
☆ Whisker-based Active Tactile Perception for Contour Reconstruction
Yixuan Dang, Qinyang Xu, Yu Zhang, Xiangtong Yao, Liding Zhang, Zhenshan Bing, Florian Roehrbein, Alois Knoll
Perception using whisker-inspired tactile sensors currently faces a major
challenge: the lack of active control in robots based on direct contact
information from the whisker. To accurately reconstruct object contours, it is
crucial for the whisker sensor to continuously follow and maintain an
appropriate relative touch pose on the surface. This is especially important
for localization based on tip contact, which has a low tolerance for sharp
surfaces and must avoid slipping into tangential contact. In this paper, we
first construct a magnetically transduced whisker sensor featuring a compact
and robust suspension system composed of three flexible spiral arms. We develop
a method that leverages a characterized whisker deflection profile to directly
extract the tip contact position using gradient descent, with a Bayesian filter
applied to reduce fluctuations. We then propose an active motion control policy
to maintain the optimal relative pose of the whisker sensor against the object
surface. A B-Spline curve is employed to predict the local surface curvature
and determine the sensor orientation. Results demonstrate that our algorithm
can effectively track objects and reconstruct contours with sub-millimeter
accuracy. Finally, we validate the method in simulations and real-world
experiments where a robot arm drives the whisker sensor to follow the surfaces
of three different objects.
☆ GSFusion:Globally Optimized LiDAR-Inertial-Visual Mapping for Gaussian Splatting
While 3D Gaussian Splatting (3DGS) has revolutionized photorealistic mapping,
conventional approaches based on camera sensor, even RGB-D, suffer from
fundamental limitations such as high computational load, failure in
environments with poor texture or illumination, and short operational ranges.
LiDAR emerges as a robust alternative, but its integration with 3DGS introduces
new challenges, such as the need for exceptional global alignment for
photorealistic quality and prolonged optimization times caused by sparse data.
To address these challenges, we propose GSFusion, an online
LiDAR-Inertial-Visual mapping system that ensures high-precision map
consistency through a surfel-to-surfel constraint in the global pose-graph
optimization. To handle sparse data, our system employs a pixel-aware Gaussian
initialization strategy for efficient representation and a bounded sigmoid
constraint to prevent uncontrolled Gaussian growth. Experiments on public and
our datasets demonstrate our system outperforms existing 3DGS SLAM systems in
terms of rendering quality and map-building efficiency.
☆ Simulation-based planning of Motion Sequences for Automated Procedure Optimization in Multi-Robot Assembly Cells
Reconfigurable multi-robot cells offer a promising approach to meet
fluctuating assembly demands. However, the recurrent planning of their
configurations introduces new challenges, particularly in generating optimized,
coordinated multi-robot motion sequences that minimize the assembly duration.
This work presents a simulation-based method for generating such optimized
sequences. The approach separates assembly steps into task-related core
operations and connecting traverse operations. While core operations are
constrained and predetermined, traverse operations offer substantial
optimization potential. Scheduling the core operations is formulated as an
optimization problem, requiring feasible traverse operations to be integrated
using a decomposition-based motion planning strategy. Several solution
techniques are explored, including a sampling heuristic, tree-based search and
gradient-free optimization. For motion planning, a decomposition method is
proposed that identifies specific areas in the schedule, which can be solved
independently with modified centralized path planning algorithms. The proposed
method generates efficient and collision-free multi-robot assembly procedures
that outperform a baseline relying on decentralized, robot-individual motion
planning. Its effectiveness is demonstrated through simulation experiments.
☆ Quadratic Programming-Based Posture Manipulation and Thrust-vectoring for Agile Dynamic Walking on Narrow Pathways
Chenghao Wang, Eric Sihite, Kaushik Venkatesh Krishnamurthy, Shreyansh Pitroda, Adarsh Salagame, Alireza Ramezani, Morteza Gharib
There has been significant advancement in legged robot's agility where they
can show impressive acrobatic maneuvers, such as parkour. These maneuvers rely
heavily on posture manipulation. To expand the stability and locomotion
plasticity, we use the multi-modal ability in our legged-aerial platform, the
Husky Beta, to perform thruster-assisted walking. This robot has thrusters on
each of its sagittal knee joints which can be used to stabilize its frontal
dynamic as it walks. In this work, we perform a simulation study of quadruped
narrow-path walking with Husky $\beta$, where the robot will utilize its
thrusters to stably walk on a narrow path. The controller is designed based on
a centroidal dynamics model with thruster and foot ground contact forces as
inputs. These inputs are regulated using a QP solver to be used in a model
predictive control framework. In addition to narrow-path walking, we also
perform a lateral push-recovery simulation to study how the thrusters can be
used to stabilize the frontal dynamics.
☆ Benchmarking Massively Parallelized Multi-Task Reinforcement Learning for Robotics Tasks
Multi-task Reinforcement Learning (MTRL) has emerged as a critical training
paradigm for applying reinforcement learning (RL) to a set of complex
real-world robotic tasks, which demands a generalizable and robust policy. At
the same time, \emph{massively parallelized training} has gained popularity,
not only for significantly accelerating data collection through GPU-accelerated
simulation but also for enabling diverse data collection across multiple tasks
by simulating heterogeneous scenes in parallel. However, existing MTRL research
has largely been limited to off-policy methods like SAC in the
low-parallelization regime. MTRL could capitalize on the higher asymptotic
performance of on-policy algorithms, whose batches require data from the
current policy, and as a result, take advantage of massive parallelization
offered by GPU-accelerated simulation. To bridge this gap, we introduce a
massively parallelized $\textbf{M}$ulti-$\textbf{T}$ask $\textbf{Bench}$mark
for robotics (MTBench), an open-sourced benchmark featuring a broad
distribution of 50 manipulation tasks and 20 locomotion tasks, implemented
using the GPU-accelerated simulator IsaacGym. MTBench also includes four base
RL algorithms combined with seven state-of-the-art MTRL algorithms and
architectures, providing a unified framework for evaluating their performance.
Our extensive experiments highlight the superior speed of evaluating MTRL
approaches using MTBench, while also uncovering unique challenges that arise
from combining massive parallelism with MTRL. Code is available at
$\href{https://github.com/Viraj-Joshi/MTBench}{
https://github.com/Viraj-Joshi/MTBench}$
comment: RLC 2025
♻ ☆ UniLegs: Universal Multi-Legged Robot Control through Morphology-Agnostic Policy Distillation IROS 2025
Developing controllers that generalize across diverse robot morphologies
remains a significant challenge in legged locomotion. Traditional approaches
either create specialized controllers for each morphology or compromise
performance for generality. This paper introduces a two-stage teacher-student
framework that bridges this gap through policy distillation. First, we train
specialized teacher policies optimized for individual morphologies, capturing
the unique optimal control strategies for each robot design. Then, we distill
this specialized expertise into a single Transformer-based student policy
capable of controlling robots with varying leg configurations. Our experiments
across five distinct legged morphologies demonstrate that our approach
preserves morphology-specific optimal behaviors, with the Transformer
architecture achieving 94.47% of teacher performance on training morphologies
and 72.64% on unseen robot designs. Comparative analysis reveals that
Transformer-based architectures consistently outperform MLP baselines by
leveraging attention mechanisms to effectively model joint relationships across
different kinematic structures. We validate our approach through successful
deployment on a physical quadruped robot, demonstrating the practical viability
of our morphology-agnostic control framework. This work presents a scalable
solution for developing universal legged robot controllers that maintain
near-optimal performance while generalizing across diverse morphologies.
comment: 6 pages, 3 figures, IROS 2025
♻ ☆ Diffusion Beats Autoregressive in Data-Constrained Settings
Autoregressive (AR) models have long dominated the landscape of large
language models, driving progress across a wide range of tasks. Recently,
diffusion-based language models have emerged as a promising alternative, though
their advantages over AR models remain underexplored. In this paper, we
systematically study masked diffusion models in data-constrained settings-where
training involves repeated passes over limited data-and find that they
significantly outperform AR models when compute is abundant but data is scarce.
Diffusion models make better use of repeated data, achieving lower validation
loss and superior downstream performance. We interpret this advantage as
implicit data augmentation: masked diffusion exposes the model to a diverse
distribution of token orderings and prediction tasks, unlike AR's fixed
left-to-right factorization. We find new scaling laws for diffusion models and
derive a closed-form expression for the critical compute threshold at which
diffusion begins to outperform AR. These results suggest that when data, not
compute, is the bottleneck, diffusion models offer a compelling alternative to
the standard AR paradigm. Our code is available at:
https://diffusion-scaling.github.io.
comment: Project Webpage: https://diffusion-scaling.github.io
♻ ☆ SHINE: Social Homology Identification for Navigation in Crowded Environments
Diego Martinez-Baselga, Oscar de Groot, Luzia Knoedler, Luis Riazuelo, Javier Alonso-Mora, Luis Montano
Navigating mobile robots in social environments remains a challenging task
due to the intricacies of human-robot interactions. Most of the motion planners
designed for crowded and dynamic environments focus on choosing the best
velocity to reach the goal while avoiding collisions, but do not explicitly
consider the high-level navigation behavior (avoiding through the left or right
side, letting others pass or passing before others, etc.). In this work, we
present a novel motion planner that incorporates topology distinct paths
representing diverse navigation strategies around humans. The planner selects
the topology class that imitates human behavior the best using a deep neural
network model trained on real-world human motion data, ensuring socially
intelligent and contextually aware navigation. Our system refines the chosen
path through an optimization-based local planner in real time, ensuring
seamless adherence to desired social behaviors. In this way, we decouple
perception and local planning from the decision-making process. We evaluate the
prediction accuracy of the network with real-world data. In addition, we assess
the navigation capabilities in both simulation and a real-world platform,
comparing it with other state-of-the-art planners. We demonstrate that our
planner exhibits socially desirable behaviors and shows a smooth and remarkable
performance.
comment: This paper has been accepted for publication at The International
Journal of Robotics Research. Please, when citing the paper, refer to the
official manuscript with the following DOI: 10.1177/02783649251344639
♻ ☆ Decentralized Uncertainty-Aware Multi-Agent Collision Avoidance with Model Predictive Path Integral IROS2025
Decentralized multi-agent navigation under uncertainty is a complex task that
arises in numerous robotic applications. It requires collision avoidance
strategies that account for both kinematic constraints, sensing and action
execution noise. In this paper, we propose a novel approach that integrates the
Model Predictive Path Integral (MPPI) with a probabilistic adaptation of
Optimal Reciprocal Collision Avoidance. Our method ensures safe and efficient
multi-agent navigation by incorporating probabilistic safety constraints
directly into the MPPI sampling process via a Second-Order Cone Programming
formulation. This approach enables agents to operate independently using local
noisy observations while maintaining safety guarantees. We validate our
algorithm through extensive simulations with differential-drive robots and
benchmark it against state-of-the-art methods, including ORCA-DD and B-UAVC.
Results demonstrate that our approach outperforms them while achieving high
success rates, even in densely populated environments. Additionally, validation
in the Gazebo simulator confirms its practical applicability to robotic
platforms. A source code is available at
http://github.com/PathPlanning/MPPI-Collision-Avoidance.
comment: This is a pre-print of the paper accepted to IROS2025. The manuscript
includes 8 pages, 4 figures, and 1 table. A supplementary video is available
at https://youtu.be/_D4zDYJ4KCk Updated version: added link to source code in
the abstract; updated experimental results description in Section VI.A;
updated author affiliation and funding information; minor typo corrections
♻ ☆ SmartPNT-MSF: A Multi-Sensor Fusion Dataset for Positioning and Navigation Research
High-precision navigation and positioning systems are critical for
applications in autonomous vehicles and mobile mapping, where robust and
continuous localization is essential. To test and enhance the performance of
algorithms, some research institutions and companies have successively
constructed and publicly released datasets. However, existing datasets still
suffer from limitations in sensor diversity and environmental coverage. To
address these shortcomings and advance development in related fields, the
SmartPNT Multisource Integrated Navigation, Positioning, and Attitude Dataset
has been developed. This dataset integrates data from multiple sensors,
including Global Navigation Satellite Systems (GNSS), Inertial Measurement
Units (IMU), optical cameras, and LiDAR, to provide a rich and versatile
resource for research in multi-sensor fusion and high-precision navigation. The
dataset construction process is thoroughly documented, encompassing sensor
configurations, coordinate system definitions, and calibration procedures for
both cameras and LiDAR. A standardized framework for data collection and
processing ensures consistency and scalability, enabling large-scale analysis.
Validation using state-of-the-art Simultaneous Localization and Mapping (SLAM)
algorithms, such as VINS-Mono and LIO-SAM, demonstrates the dataset's
applicability for advanced navigation research. Covering a wide range of
real-world scenarios, including urban areas, campuses, tunnels, and suburban
environments, the dataset offers a valuable tool for advancing navigation
technologies and addressing challenges in complex environments. By providing a
publicly accessible, high-quality dataset, this work aims to bridge gaps in
sensor diversity, data accessibility, and environmental representation,
fostering further innovation in the field.
♻ ☆ Generalizable Motion Policies through Keypoint Parameterization and Transportation Maps
Learning from Interactive Demonstrations has revolutionized the way
non-expert humans teach robots. It is enough to kinesthetically move the robot
around to teach pick-and-place, dressing, or cleaning policies. However, the
main challenge is correctly generalizing to novel situations, e.g., different
surfaces to clean or different arm postures to dress. This article proposes a
novel task parameterization and generalization to transport the original robot
policy, i.e., position, velocity, orientation, and stiffness. Unlike the state
of the art, only a set of keypoints is tracked during the demonstration and the
execution, e.g., a point cloud of the surface to clean. We then propose to fit
a nonlinear transformation that would deform the space and then the original
policy using the paired source and target point sets. The use of function
approximators like Gaussian Processes allows us to generalize, or transport,
the policy from every space location while estimating the uncertainty of the
resulting policy due to the limited task keypoints and the reduced number of
demonstrations. We compare the algorithm's performance with state-of-the-art
task parameterization alternatives and analyze the effect of different function
approximators. We also validated the algorithm on robot manipulation tasks,
i.e., different posture arm dressing, different location product reshelving,
and different shape surface cleaning.
comment: This article was accepted at IEEE Transactions on Robotics (T-RO)
♻ ☆ iFANnpp: Nuclear Power Plant Digital Twin for Robots and Autonomous Intelligence
Robotics has gained attention in the nuclear industry due to its precision
and ability to automate tasks. However, there is a critical need for advanced
simulation and control methods to predict robot behavior and optimize plant
performance, motivating the use of digital twins. Most existing digital twins
do not offer a total design of a nuclear power plant. Moreover, they are
designed for specific algorithms or tasks, making them unsuitable for broader
research applications. In response, this work proposes a comprehensive nuclear
power plant digital twin designed to improve real-time monitoring, operational
efficiency, and predictive maintenance. A full nuclear power plant is modeled
in Unreal Engine 5 and integrated with a high-fidelity Generic Pressurized
Water Reactor Simulator to create a realistic model of a nuclear power plant
and a real-time updated virtual environment. The virtual environment provides
various features for researchers to easily test custom robot algorithms and
frameworks.
♻ ☆ Generalizable Image Repair for Robust Visual Control IROS 2025
Vision-based control relies on accurate perception to achieve robustness.
However, image distribution changes caused by sensor noise, adverse weather,
and dynamic lighting can degrade perception, leading to suboptimal control
decisions. Existing approaches, including domain adaptation and adversarial
training, improve robustness but struggle to generalize to unseen corruptions
while introducing computational overhead. To address this challenge, we propose
a real-time image repair module that restores corrupted images before they are
used by the controller. Our method leverages generative adversarial models,
specifically CycleGAN and pix2pix, for image repair. CycleGAN enables unpaired
image-to-image translation to adapt to novel corruptions, while pix2pix
exploits paired image data when available to improve the quality. To ensure
alignment with control performance, we introduce a control-focused loss
function that prioritizes perceptual consistency in repaired images. We
evaluated our method in a simulated autonomous racing environment with various
visual corruptions. The results show that our approach significantly improves
performance compared to baselines, mitigating distribution shift and enhancing
controller reliability.
comment: 8 pages, 4 figures, 2 tables, 2025 IEEE/RSJ International Conference
on Intelligent Robots and Systems (IROS 2025)
♻ ☆ KGN-Pro: Keypoint-Based Grasp Prediction through Probabilistic 2D-3D Correspondence Learning
High-level robotic manipulation tasks demand flexible 6-DoF grasp estimation
to serve as a basic function. Previous approaches either directly generate
grasps from point-cloud data, suffering from challenges with small objects and
sensor noise, or infer 3D information from RGB images, which introduces
expensive annotation requirements and discretization issues. Recent methods
mitigate some challenges by retaining a 2D representation to estimate grasp
keypoints and applying Perspective-n-Point (PnP) algorithms to compute 6-DoF
poses. However, these methods are limited by their non-differentiable nature
and reliance solely on 2D supervision, which hinders the full exploitation of
rich 3D information. In this work, we present KGN-Pro, a novel grasping network
that preserves the efficiency and fine-grained object grasping of previous KGNs
while integrating direct 3D optimization through probabilistic PnP layers.
KGN-Pro encodes paired RGB-D images to generate Keypoint Map, and further
outputs a 2D confidence map to weight keypoint contributions during
re-projection error minimization. By modeling the weighted sum of squared
re-projection errors probabilistically, the network effectively transmits 3D
supervision to its 2D keypoint predictions, enabling end-to-end learning.
Experiments on both simulated and real-world platforms demonstrate that KGN-Pro
outperforms existing methods in terms of grasp cover rate and success rate.
♻ ☆ MaxInfoRL: Boosting exploration in reinforcement learning through information gain maximization
Reinforcement learning (RL) algorithms aim to balance exploiting the current
best strategy with exploring new options that could lead to higher rewards.
Most common RL algorithms use undirected exploration, i.e., select random
sequences of actions. Exploration can also be directed using intrinsic rewards,
such as curiosity or model epistemic uncertainty. However, effectively
balancing task and intrinsic rewards is challenging and often task-dependent.
In this work, we introduce a framework, MaxInfoRL, for balancing intrinsic and
extrinsic exploration. MaxInfoRL steers exploration towards informative
transitions, by maximizing intrinsic rewards such as the information gain about
the underlying task. When combined with Boltzmann exploration, this approach
naturally trades off maximization of the value function with that of the
entropy over states, rewards, and actions. We show that our approach achieves
sublinear regret in the simplified setting of multi-armed bandits. We then
apply this general formulation to a variety of off-policy model-free RL methods
for continuous state-action spaces, yielding novel algorithms that achieve
superior performance across hard exploration problems and complex scenarios
such as visual control tasks.
♻ ☆ Line-Search Filter Differential Dynamic Programming for Optimal Control with Nonlinear Equality Constraints
We present FilterDDP, a differential dynamic programming algorithm for
solving discrete-time, optimal control problems (OCPs) with nonlinear equality
constraints. Unlike prior methods based on merit functions or the augmented
Lagrangian class of algorithms, FilterDDP uses a step filter in conjunction
with a line search to handle equality constraints. We identify two important
design choices for the step filter criteria which lead to robust numerical
performance: 1) we use the Lagrangian instead of the cost as one of the filter
criterion and, 2) for the stopping criteria and backward pass Hessians, we
replace the value function gradient with an estimated dual variable of the
dynamics constraints. Both choices are rigorously justified, for 2) in
particular by a formal proof of local quadratic convergence. We validate
FilterDDP on three contact implicit trajectory optimisation problems which
arise in robotics.
♻ ☆ CoA-VLA: Improving Vision-Language-Action Models via Visual-Textual Chain-of-Affordance
Jinming Li, Yichen Zhu, Zhibin Tang, Junjie Wen, Minjie Zhu, Xiaoyu Liu, Chengmeng Li, Ran Cheng, Yaxin Peng, Yan Peng, Feifei Feng
Robot foundation models, particularly Vision-Language-Action (VLA) models,
have garnered significant attention for their ability to enhance robot policy
learning, greatly improving robot's generalization and robustness. OpenAI's
recent model, O1, showcased impressive capabilities in solving complex problems
by utilizing extensive reasoning chains. This prompts an important question:
can robot models achieve better performance in multi-task , complex
environments by reviewing prior observations and then providing task-specific
reasoning to guide action prediction? In this paper, we introduce
Chain-of-Affordance (CoA-VLA) , a novel approach to scaling robot models by
incorporating reasoning in the format of sequential robot affordances to
facilitate task completion. Specifically, we prompt the model to consider the
following four types of affordances before taking action: (1) object affordance
- what object to manipulate and where it is ; (2) grasp affordance - the
specific object part to grasp ; (3) spatial affordance - the optimal space to
place the object ; and (4) movement affordance-the collision - free path for
movement. We further transform each affordance into two prompting formats:
visual affordance and textual affordance. We introduce a novel vision-language
co-injection module that integrates this knowledge into the policy network.
This allows the robot to leverage essential contextual information during
action inference, resulting in improved precision and robustness. Our
experiments demonstrate that CoA-VLA outperforms state-of-the-art robot
foundation models, including OpenVLA and Octo, on a variety of tasks.
Furthermore, CoA-VLA exhibits strong generalization capabilities, including
recognizing unseen object poses, identifying free space, and avoiding obstacles
in novel environments.
comment: Project webpage is available at https://chain-of-affordance.github.io
♻ ★ AKF-LIO: LiDAR-Inertial Odometry with Gaussian Map by Adaptive Kalman Filter IROS 2025
Existing LiDAR-Inertial Odometry (LIO) systems typically use sensor-specific
or environment-dependent measurement covariances during state estimation,
leading to laborious parameter tuning and suboptimal performance in challenging
conditions (e.g., sensor degeneracy and noisy observations). Therefore, we
propose an Adaptive Kalman Filter (AKF) framework that dynamically estimates
time-varying noise covariances of LiDAR and Inertial Measurement Unit (IMU)
measurements, enabling context-aware confidence weighting between sensors.
During LiDAR degeneracy, the system prioritizes IMU data while suppressing
contributions from unreliable inputs like moving objects or noisy point clouds.
Furthermore, a compact Gaussian-based map representation is introduced to model
environmental planarity and spatial noise. A correlated registration strategy
ensures accurate plane normal estimation via pseudo-merge, even in unstructured
environments like forests. Extensive experiments validate the robustness of the
proposed system across diverse environments, including dynamic scenes and
geometrically degraded scenarios. Our method achieves reliable localization
results across all MARS-LVIG sequences and ranks 8th on the KITTI Odometry
Benchmark. The code will be released at https://github.com/xpxie/AKF-LIO.git.
comment: Submitted to IROS 2025 Conference,
https://github.com/xpxie/AKF-LIO.git
♻ ☆ Allocation for Omnidirectional Aerial Robots: Incorporating Power Dynamics
Tilt-rotor aerial robots are more dynamic and versatile than fixed-rotor
platforms, since the thrust vector and body orientation are decoupled. However,
the coordination of servos and propellers (the allocation problem) is not
trivial, especially accounting for overactuation and actuator dynamics. We
incrementally build and present three novel allocation methods for tiltrotor
aerial robots, comparing them to state-of-the-art methods on a real system
performing dynamic maneuvers. We extend the state-of-the-art geometric
allocation into a differential allocation, which uses the platform's redundancy
and does not suffer from singularities. We expand it by incorporating actuator
dynamics and propeller power dynamics. These allow us to model dynamic
propeller acceleration limits, bringing two main advantages: balancing
propeller speed without the need of nullspace goals and allowing the platform
to selectively turn-off propellers during flight, opening the door to new
manipulation possibilities. We also use actuator dynamics and limits to
normalize the allocation problem, making it easier to tune and allowing it to
track 70% faster trajectories than a geometric allocation.
♻ ☆ UniLGL: Learning Uniform Place Recognition for FOV-limited/Panoramic LiDAR Global Localization
Existing LGL methods typically consider only partial information (e.g.,
geometric features) from LiDAR observations or are designed for homogeneous
LiDAR sensors, overlooking the uniformity in LGL. In this work, a uniform LGL
method is proposed, termed UniLGL, which simultaneously achieves spatial and
material uniformity, as well as sensor-type uniformity. The key idea of the
proposed method is to encode the complete point cloud, which contains both
geometric and material information, into a pair of BEV images (i.e., a spatial
BEV image and an intensity BEV image). An end-to-end multi-BEV fusion network
is designed to extract uniform features, equipping UniLGL with spatial and
material uniformity. To ensure robust LGL across heterogeneous LiDAR sensors, a
viewpoint invariance hypothesis is introduced, which replaces the conventional
translation equivariance assumption commonly used in existing LPR networks and
supervises UniLGL to achieve sensor-type uniformity in both global descriptors
and local feature representations. Finally, based on the mapping between local
features on the 2D BEV image and the point cloud, a robust global pose
estimator is derived that determines the global minimum of the global pose on
SE(3) without requiring additional registration. To validate the effectiveness
of the proposed uniform LGL, extensive benchmarks are conducted in real-world
environments, and the results show that the proposed UniLGL is demonstratively
competitive compared to other State-of-the-Art LGL methods. Furthermore, UniLGL
has been deployed on diverse platforms, including full-size trucks and agile
Micro Aerial Vehicles (MAVs), to enable high-precision localization and mapping
as well as multi-MAV collaborative exploration in port and forest environments,
demonstrating the applicability of UniLGL in industrial and field scenarios.
♻ ☆ Exploiting Local Observations for Robust Robot Learning
Wenshuai Zhao, Eetu-Aleksi Rantala, Sahar Salimpour, Zhiyuan Li, Joni Pajarinen, Jorge Peña Queralta
While many robotic tasks can be addressed using either centralized
single-agent control with full state observation or decentralized multi-agent
control, clear criteria for choosing between these approaches remain
underexplored. This paper systematically investigates how multi-agent
reinforcement learning (MARL) with local observations can improve robustness in
complex robotic systems compared to traditional centralized control. Through
theoretical analysis and empirical validation, we show that in certain tasks,
decentralized MARL can achieve performance comparable to centralized methods
while exhibiting greater resilience to perturbations and agent failures. By
analytically demonstrating the equivalence of single-agent reinforcement
learning (SARL) and MARL under full observability, we identify observability as
the critical factor distinguishing the two paradigms. We further derive bounds
quantifying performance degradation under external perturbations for locally
observable policies. Empirical results on standard MARL benchmarks confirm that
MARL with limited observations can maintain competitive performance. Finally,
real-world experiments with a mobile manipulator demonstrate that decentralized
MARL controllers achieve markedly improved robustness to agent malfunctions and
environmental disturbances relative to centralized baselines. Together, these
findings highlight MARL with local observations as a robust and practical
alternative to conventional centralized control in complex robotic systems.
comment: 8 pages, 8 figures
♻ ☆ Estimating Scene Flow in Robot Surroundings with Distributed Miniaturized Time-of-Flight Sensors
Tracking motions of humans or objects in the surroundings of the robot is
essential to improve safe robot motions and reactions. In this work, we present
an approach for scene flow estimation from low-density and noisy point clouds
acquired from miniaturized Time of Flight (ToF) sensors distributed on the
robot body. The proposed method clusters points from consecutive frames and
applies Iterative Closest Point (ICP) to estimate a dense motion flow, with
additional steps introduced to mitigate the impact of sensor noise and
low-density data points. Specifically, we employ a fitness-based classification
to distinguish between stationary and moving points and an inlier removal
strategy to refine geometric correspondences. The proposed approach is
validated in an experimental setup where 24 ToF are used to estimate the
velocity of an object moving at different controlled speeds. Experimental
results show that the method consistently approximates the direction of the
motion and its magnitude with an error which is in line with sensor noise.
comment: 7 pages, 5 figures, 2 tables, 1 algorithm, IEEE RO-MAN 2025 accepted
paper
♻ ☆ EP-Diffuser: An Efficient Diffusion Model for Traffic Scene Generation and Prediction via Polynomial Representations
As the prediction horizon increases, predicting the future evolution of
traffic scenes becomes increasingly difficult due to the multi-modal nature of
agent motion. Most state-of-the-art (SotA) prediction models primarily focus on
forecasting the most likely future. However, for the safe operation of
autonomous vehicles, it is equally important to cover the distribution for
plausible motion alternatives. To address this, we introduce EP-Diffuser, a
novel parameter-efficient diffusion-based generative model designed to capture
the distribution of possible traffic scene evolutions. Conditioned on road
layout and agent history, our model acts as a predictor and generates diverse,
plausible scene continuations. We benchmark EP-Diffuser against two SotA models
in terms of accuracy and plausibility of predictions on the Argoverse 2
dataset. Despite its significantly smaller model size, our approach achieves
both highly accurate and plausible traffic scene predictions. We further
evaluate model generalization ability in an out-of-distribution (OoD) test
setting using Waymo Open dataset and show superior robustness of our approach.
The code and model checkpoints are available at:
https://github.com/continental/EP-Diffuser.
♻ ★ Tiny LiDARs for Manipulator Self-Awareness: Sensor Characterization and Initial Localization Experiments
For several tasks, ranging from manipulation to inspection, it is beneficial
for robots to localize a target object in their surroundings. In this paper, we
propose an approach that utilizes coarse point clouds obtained from
miniaturized VL53L5CX Time-of-Flight (ToF) sensors (tiny LiDARs) to localize a
target object in the robot's workspace. We first conduct an experimental
campaign to calibrate the dependency of sensor readings on relative range and
orientation to targets. We then propose a probabilistic sensor model, which we
validate in an object pose estimation task using a Particle Filter (PF). The
results show that the proposed sensor model improves the performance of the
localization of the target object with respect to two baselines: one that
assumes measurements are free from uncertainty and one in which the confidence
is provided by the sensor datasheet.
comment: 7 pages, 6 figures, 3 tables, IEEE/RSJ International Conference on
Intelligent Robots and Systems 2025 accepted paper
♻ ☆ SDHN: Skewness-Driven Hypergraph Networks for Enhanced Localized Multi-Robot Coordination
Multi-Agent Reinforcement Learning is widely used for multi-robot
coordination, where simple graphs typically model pairwise interactions.
However, such representations fail to capture higher-order collaborations,
limiting effectiveness in complex tasks. While hypergraph-based approaches
enhance cooperation, existing methods often generate arbitrary hypergraph
structures and lack adaptability to environmental uncertainties. To address
these challenges, we propose the Skewness-Driven Hypergraph Network (SDHN),
which employs stochastic Bernoulli hyperedges to explicitly model higher-order
multi-robot interactions. By introducing a skewness loss, SDHN promotes an
efficient structure with Small-Hyperedge Dominant Hypergraph, allowing robots
to prioritize localized synchronization while still adhering to the overall
information, similar to human coordination. Extensive experiments on Moving
Agents in Formation and Robotic Warehouse tasks validate SDHN's effectiveness,
demonstrating superior performance over state-of-the-art baselines.
♻ ☆ OpenFly: A Comprehensive Platform for Aerial Vision-Language Navigation
Yunpeng Gao, Chenhui Li, Zhongrui You, Junli Liu, Zhen Li, Pengan Chen, Qizhi Chen, Zhonghan Tang, Liansheng Wang, Penghui Yang, Yiwen Tang, Yuhang Tang, Shuai Liang, Songyi Zhu, Ziqin Xiong, Yifei Su, Xinyi Ye, Jianan Li, Yan Ding, Dong Wang, Zhigang Wang, Bin Zhao, Xuelong Li
Vision-Language Navigation (VLN) aims to guide agents by leveraging language
instructions and visual cues, playing a pivotal role in embodied AI. Indoor VLN
has been extensively studied, whereas outdoor aerial VLN remains underexplored.
The potential reason is that outdoor aerial view encompasses vast areas, making
data collection more challenging, which results in a lack of benchmarks. To
address this problem, we propose OpenFly, a platform comprising various
rendering engines, a versatile toolchain, and a large-scale benchmark for
aerial VLN. Firstly, we integrate diverse rendering engines and advanced
techniques for environment simulation, including Unreal Engine, GTA V, Google
Earth, and 3D Gaussian Splatting (3D GS). Particularly, 3D GS supports
real-to-sim rendering, further enhancing the realism of our environments.
Secondly, we develop a highly automated toolchain for aerial VLN data
collection, streamlining point cloud acquisition, scene semantic segmentation,
flight trajectory creation, and instruction generation. Thirdly, based on the
toolchain, we construct a large-scale aerial VLN dataset with 100k
trajectories, covering diverse heights and lengths across 18 scenes. Moreover,
we propose OpenFly-Agent, a keyframe-aware VLN model emphasizing key
observations during flight. For benchmarking, extensive experiments and
analyses are conducted, evaluating several recent VLN methods and showcasing
the superiority of our OpenFly platform and agent. The toolchain, dataset, and
codes will be open-sourced.
comment: 20 pages, 11 figures
♻ ☆ LoL-NMPC: Low-Level Dynamics Integration in Nonlinear Model Predictive Control for Unmanned Aerial Vehicles IROS 2025
[Accepted to IROS 2025] In this paper, we address the problem of tracking
high-speed agile trajectories for Unmanned Aerial Vehicles(UAVs), where model
inaccuracies can lead to large tracking errors. Existing Nonlinear Model
Predictive Controller(NMPC) methods typically neglect the dynamics of the
low-level flight controllers such as underlying PID controller present in many
flight stacks, and this results in sub-optimal tracking performance at high
speeds and accelerations. To this end, we propose a novel NMPC formulation,
LoL-NMPC, which explicitly incorporates low-level controller dynamics and motor
dynamics in order to minimize trajectory tracking errors while maintaining
computational efficiency. By leveraging linear constraints inside low-level
dynamics, our approach inherently accounts for actuator constraints without
requiring additional reallocation strategies. The proposed method is validated
in both simulation and real-world experiments, demonstrating improved tracking
accuracy and robustness at speeds up to 98.57 km/h and accelerations of 3.5 g.
Our results show an average 21.97 % reduction in trajectory tracking error over
standard NMPC formulation, with LoL-NMPC maintaining real-time feasibility at
100 Hz on an embedded ARM-based flight computer.
comment: Accepted to IROS 2025
♻ ☆ KineDepth: Utilizing Robot Kinematics for Online Metric Depth Estimation
Depth perception is essential for a robot's spatial and geometric
understanding of its environment, with many tasks traditionally relying on
hardware-based depth sensors like RGB-D or stereo cameras. However, these
sensors face practical limitations, including issues with transparent and
reflective objects, high costs, calibration complexity, spatial and energy
constraints, and increased failure rates in compound systems. While monocular
depth estimation methods offer a cost-effective and simpler alternative, their
adoption in robotics is limited due to their output of relative rather than
metric depth, which is crucial for robotics applications. In this paper, we
propose a method that utilizes a single calibrated camera, enabling the robot
to act as a "measuring stick" to convert relative depth estimates into metric
depth in real-time as tasks are performed. Our approach employs an LSTM-based
metric depth regressor, trained online and refined through probabilistic
filtering, to accurately restore the metric depth across the monocular depth
map, particularly in areas proximal to the robot's motion. Experiments with
real robots demonstrate that our method significantly outperforms current
state-of-the-art monocular metric depth estimation techniques, achieving a
22.1% reduction in depth error and a 52% increase in success rate for a
downstream task.
comment: 8 pages, 5 figures
♻ ☆ Humanoids in Hospitals: A Technical Study of Humanoid Robot Surrogates for Dexterous Medical Interventions
Soofiyan Atar, Xiao Liang, Calvin Joyce, Florian Richter, Wood Ricardo, Charles Goldberg, Preetham Suresh, Michael Yip
The increasing demand for healthcare workers, driven by aging populations and
labor shortages, presents a significant challenge for hospitals. Humanoid
robots have the potential to alleviate these pressures by leveraging their
human-like dexterity and adaptability to assist in medical procedures. This
work conducted an exploratory study on the feasibility of humanoid robots
performing direct clinical tasks through teleoperation. A bimanual
teleoperation system was developed for the Unitree G1 Humanoid Robot,
integrating high-fidelity pose tracking, custom grasping configurations, and an
impedance controller to safely and precisely manipulate medical tools. The
system is evaluated in seven diverse medical procedures, including physical
examinations, emergency interventions, and precision needle tasks. Our results
demonstrate that humanoid robots can successfully replicate critical aspects of
human medical assessments and interventions, with promising quantitative
performance in ventilation and ultrasound-guided tasks. However, challenges
remain, including limitations in force output for procedures requiring high
strength and sensor sensitivity issues affecting clinical accuracy. This study
highlights the potential and current limitations of humanoid robots in hospital
settings and lays the groundwork for future research on robotic healthcare
integration.
comment: 8 pages
♻ ☆ ActSafe: Active Exploration with Safety Constraints for Reinforcement Learning
Reinforcement learning (RL) is ubiquitous in the development of modern AI
systems. However, state-of-the-art RL agents require extensive, and potentially
unsafe, interactions with their environments to learn effectively. These
limitations confine RL agents to simulated environments, hindering their
ability to learn directly in real-world settings. In this work, we present
ActSafe, a novel model-based RL algorithm for safe and efficient exploration.
ActSafe learns a well-calibrated probabilistic model of the system and plans
optimistically w.r.t. the epistemic uncertainty about the unknown dynamics,
while enforcing pessimism w.r.t. the safety constraints. Under regularity
assumptions on the constraints and dynamics, we show that ActSafe guarantees
safety during learning while also obtaining a near-optimal policy in finite
time. In addition, we propose a practical variant of ActSafe that builds on
latest model-based RL advancements and enables safe exploration even in
high-dimensional settings such as visual control. We empirically show that
ActSafe obtains state-of-the-art performance in difficult exploration tasks on
standard safe deep RL benchmarks while ensuring safety during learning.
♻ ☆ Controllable Traffic Simulation through LLM-Guided Hierarchical Reasoning and Refinement IROS 2025
Evaluating autonomous driving systems in complex and diverse traffic
scenarios through controllable simulation is essential to ensure their safety
and reliability. However, existing traffic simulation methods face challenges
in their controllability. To address this, we propose a novel diffusion-based
and LLM-enhanced traffic simulation framework. Our approach incorporates a
high-level understanding module and a low-level refinement module, which
systematically examines the hierarchical structure of traffic elements, guides
LLMs to thoroughly analyze traffic scenario descriptions step by step, and
refines the generation by self-reflection, enhancing their understanding of
complex situations. Furthermore, we propose a Frenet-frame-based cost function
framework that provides LLMs with geometrically meaningful quantities,
improving their grasp of spatial relationships in a scenario and enabling more
accurate cost function generation. Experiments on the Waymo Open Motion Dataset
(WOMD) demonstrate that our method can handle more intricate descriptions and
generate a broader range of scenarios in a controllable manner.
comment: Accepted by IROS 2025
♻ ☆ Model-Free and Real-Time Unicycle-Based Source Seeking with Differential Wheeled Robotic Experiments
Many autonomous robots aimed at source-seeking are studied, and their
controls designed, using unicycle modeling and formulation. This is true not
only for model-based controllers, but also for model-free, real-time control
methods such as extremum seeking control (ESC). In this paper, we propose a
unicycle-based ESC design applicable to differential wheeled robots that: (1)
is very simple design, based on one simple control-affine law, and without
state integrators; (2) attenuates oscillations known to persist in ESC designs
(i.e., fully stop at the source); and (3) operates in a model-free, real-time
setting, tolerating environmental/sensor noise. We provide simulation and
real-world robotic experimental results for fixed and moving light source
seeking by a differential wheeled robot using our proposed design. Results
indicate clear advantages of our proposed design when compared to the
literature, including attenuation of undesired oscillations, improved
convergence speed, and better handling of noise.