eNavi: Event-based Imitation Policies for Low-Light Indoor Mobile Robot Navigation

Abstract

Event cameras provide high dynamic range and microsecond-level temporal resolution, making them well-suited for indoor robot navigation, where conventional RGB cameras degrade under fast motion or low-light conditions. Despite advances in event-based perception spanning detection, SLAM, and pose estimation, there remains limited research on end-to-end control policies that exploit the asynchronous nature of event streams. To address this gap, we introduce a real-world indoor person-following dataset collected using a TurtleBot 2 robot, featuring synchronized raw event streams, RGB frames, and expert control actions across multiple indoor maps, trajectories under both normal and low-light conditions. We further build a multimodal data preprocessing pipeline that temporally aligns event and RGB observations while reconstructing ground-truth actions from odometry to support high-quality imitation learning. Building on this dataset, we propose a late-fusion RGB-Event navigation policy that combines dual MobileNet encoders with a transformer-based fusion module trained via behavioral cloning. A systematic evaluation of RGB-only, Event-only, and RGB-Event fusion models across 12 training variations ranging from single-path imitation to general multi-path imitation shows that policies incorporating event data, particularly the fusion model, achieve improved robustness and lower action prediction error, especially in unseen low-light conditions where RGB-only models fail.

eNavi Data collection

The data collection platform is built upon a TurtleBot 2 mobile robot (Kobuki base), with all cameras and computing units rigidly mounted to the top plate of the chassis. Onboard processing is handled by an NVIDIA Jetson Orin Nano, which runs the Robot Operating System 2 (ROS 2) framework to manage sensor drivers and data logging. The sensing suite consists of a heterogeneous camera configuration. To minimize parallax, both cameras are mounted on a beam splitter setup. The primary sensor is a Prophesee Metavision EVK4 event camera featuring the IMX636 sensor with a resolution of 1280 × 720, offering high temporal resolution with latency below 220 μs at 1k lux. The secondary sensor is a Flir Spinnaker RGB camera configured to capture frames at the same 1280 × 720 resolution as the event sensor to facilitate approximate pixel-level correspondence. The event camera interfaces with ROS 2 via the metavision_driver, while the RGB camera uses standard V4L2 drivers. For connectivity, a TP-Link Wi-Fi dongle is attached to the Jetson Orin Nano. This wireless link enables remote development access via SSH and facilitates manual teleoperation of the robot using a joystick controller during data collection. To ensure generalization of the learned policy, we curate the dataset across three distinct indoor environments. We employ two different human subjects to introduce variability in clothing, body shape, and gait. The subjects are instructed to vary their walking speeds and trajectories to capture a broad range of motion dynamics. The full dataset comprises approximately two hours of driving data, segmented into more than 175 episodes. To facilitate benchmarking, we enforce a separation of environments: one map is used for in-distribution training and testing, while the remaining two environments are reserved for generalization experiments. This split allows us to evaluate performance both within seen environments and in previously unseen layouts. We plan to release the dataset as an open-source contribution to the research community.

ENP Archieture

The ENP architecture employs a late-fusion strategy to process synchronous RGB and event-frame tensors. The model consists of two stages: (i) modality-specific encoding and (ii) attention-based fusion followed by a control head. Two parallel MobileNetV3-Small encoders extract feature embeddings from each modality. The RGB encoder processes 3-channel RGB frames with frozen pretrained weights to preserve learned features. The Event encoder processes the 2-channel event representation with trainable weights, allowing it to learn features tailored to sparse, high-temporal-resolution event data. Each encoder outputs a compact feature vector passed to the fusion module. The feature vectors from both encoders are tokenized and concatenated into a unified sequence (2 feature tokens) processed by a Transformer encoder block. This self-attention-based fusion enables the policy to adaptively weigh RGB context versus event dynamics per sample, which is important under varying illumination and motion conditions. The fused representation is fed into an MLP policy head that predicts continuous differential-drive commands: linear velocity (v) and angular velocity (ω). At inference, the policy maps each synchronized observation to an action. To systematically study the contribution of event data under different levels of task complexity and illumination, we train a family of ENP variants spanning three architectures (RGB-only, Event-only, RGB+Event fusion) and four dataset subsets from eNavi, yielding 12 model variants.

Results

We evaluate ENP on the eNavi dataset around two hypotheses: (1) Multimodal training efficiency — fusing synchronous RGB context with asynchronous event dynamics leads to faster convergence and lower validation error compared to RGB baselines, and (2) Low-light robustness / Zero-shot generalization — event-driven policies trained exclusively on normal-light data generalize better to unseen low-light conditions than RGB-only baselines, which suffer from texture degradation in the dark.

All policy variants are implemented in PyTorch and trained using AdamW with a learning rate of 2×10⁻⁴ and weight decay of 3×10⁻⁴. We use a batch size of 64 and an 80/10/10 split for training, validation, and test sets. Early stopping with a patience of 8 epochs and a maximum budget of 50 epochs is employed. RGB-only models consistently run to the full 50-epoch budget, especially on more complex (Multi-Path, Mixed-Light) settings, suggesting difficulty extracting robust control features from RGB alone. Models incorporating event information (Event-only and Fusion) converge substantially faster, often triggering early stopping between epochs 20 and 35, indicating that high-temporal-resolution event cues provide a stronger learning signal. Fusion and Event-only models achieve the lowest validation MAE in most regimes — for example, in the (Multi-Path, Mixed-Light) setting, ENP-Fusion reaches an MAE of 0.0370 compared to ENP-RGB's 0.0707.

To evaluate low-light robustness, we consider policies trained only on normal-light data and test them on both normal-light and low-light conditions. On simple single-path trajectories, ENP-RGB generalizes well to low light, with Total MAE changing only slightly (0.0210 → 0.0213), and Fusion and Event-only variants perform similarly, suggesting that for simple structured trajectories, RGB retains sufficient contrast even under reduced illumination. However, in multi-path training the benefits of fusion become pronounced: ENP-RGB changes from (0.0463 → 0.0514), while ENP-Fusion generalizes better with a lower change (0.0335 → 0.0467). These findings support the low-light robustness of fusion policies and zero-shot generalization capabilities of event data.

eNavi: Event-based Imitation Policies for Low-Light Indoor Mobile Robot Navigation

eNavi Dataset

Abstract

eNavi Data collection

ENP Archieture

Results

Recordings

BibTeX