SEBVS: Synthetic Event-based Visual Servoing for Robot Navigation and Manipulation

Abstract

Event cameras have emerged as a powerful sensing modal- ity for robotics, offering microsecond latency, high dynamic range, and low power consumption. These characteristics make them well-suited for real-time robotic perception in scenarios affected by motion blur, occlusion, and extreme changes in illumination. Despite this potential, event-based vision, particularly through video-to-event (v2e) simulation, remains underutilized in mainstream robotics simulators, limiting the advancement of event-driven solutions for navigation and manipulation. This work presents an open-source, user-friendly v2e robotics operat- ing system (ROS) package for Gazebo simulation that enables seamless event stream generation from RGB camera feeds. The package is used to investigate event-based robotic policies (ERP) for real-time navigation and manipulation. Two representative scenarios are evaluated: (1) object following with a mobile robot and (2) object detection and grasping with a robotic manipulator. Transformer-based ERPs are trained by behavior cloning and compared to RGB-based counterparts under various oper- ating conditions. Experimental results show that event-based policies consistently deliver competitive and often superior robustness in high- speed or visually challenging environments. These results highlight the potential of event-driven perception to improve real-time robotic navi- gation and manipulation, providing a foundation for broader integration of event cameras into robotic policy learning.

ERP Archieture

ERPNav and ERPArm share a lightweight early-fusion transformer. We render event polarity maps (E⁺/E⁻) from the v2e ROS2 node over a short window and stack them with RGB to form a 5-channel tensor. After per-channel normalization, a conv patch-embedding splits the image into tokens; positional encodings are added and the tokens pass through a few self-attention + MLP blocks. A pooled [CLS] token feeds a small policy head: ERPNav regresses (v, ω) for differential drive control, while ERPArm regresses a 6-DoF pre-grasp pose (x, y, z, roll, pitch, yaw). Training is behavior cloning from expert demonstrations (navigation: detector-assisted PID producing cmd_vel; manipulation: MoveIt pre-grasp waypoints), using L1/L2 losses with light action smoothing. The design is real-time, requires no depth/flow, and keeps parameters small enough for on-robot inference while retaining the benefits of event-guided perception under fast motion and extreme lighting.

Results

We evaluated each policy variant over 15 simulated episodes per task. For ERPNav, the RGB+Event model achieved the lowest centroid tracking error (106.7 ± 26.3 px) and the highest success rate (93.3%), while maintaining an appropriate following distance, outperforming RGB-only and Event-only baselines. For ERPArm, early fusion likewise delivered the best grasp-ready pose predictions: in single-object scenes it reached 41.1 ± 9.5 mm error, 71.4% accuracy, 7.8 ± 0.6 ms latency, and 51.7% success; in multi-object scenes it achieved 52.6 ± 11.3 mm, 58.9% accuracy, 7.6 ± 0.5 ms latency, and 31.8% success. Although Event-only inference was the fastest (≈3.0–3.2 ms), it trailed in accuracy and success, confirming that RGB+Event fusion offers the best overall trade-off between precision, robustness, and responsiveness across both navigation and manipulation tasks.

Results Recordings

BibTeX

@inproceedings{vinod2025sebvs,
  title     = {SEBVS: Synthetic Event-based Visual Servoing for Robot Navigation and Manipulation},
  author    = {Vinod, Krishna and Ramesh, Prithvi Jai and B N, Pavan Kumar and Chakravarthi, Bharatesh},
  booktitle = {},
  year      = {2025}
}