TESPEC: Temporally Enhanced Self-Supervised Pretraining for Event Cameras

1 University of Toronto 2 Vector Institute
Accepted at ICCV 2025

Overview

In this work, we introduce TESPEC, a sequential pre-training paradigm with a novel target specifically designed for event cameras. Sequence models, such as recurrent architectures, are naturally better suited for processing event data since they can model entire streams, whereas feed-forward models are limited to short event intervals. However, training recurrent models requires large-scale labeled event streams, which are often unavailable.

To address this challenge, we propose a self-supervised pre-training framework that enables the model to learn both temporal and spatial representations. In TESPEC, the recurrent backbone processes masked event streams and reconstructs a specially designed target that resembles grayscale images while encoding temporal information.

We evaluate TESPEC by fine-tuning the pre-trained backbone on three core perception tasks : Object Detection, Semantic Segmentation, and Depth Estimation, across five real-world datasets. Our approach achieves state-of-the-art performance on all benchmarks. The full codebase and checkpoints are publicly available; please refer to our GitHub repository and paper for further details.

TESPEC pipeline overview animation

Results

Object Detection Results

Object detection results

Semantic Segmentation Results

Semantic segmentation results

Depth Estimation Results

Depth estimation results

Inference Runtime

Inference runtime results

Citation

If you find our work useful in your research or projects, please consider citing:

@article{mohammadi2025tespec,
      title={TESPEC: Temporally-Enhanced Self-Supervised Pretraining for Event Cameras},
      author={Mohammadi, Mohammad and Wu, Ziyi and Gilitschenski, Igor},
      journal={arXiv preprint arXiv:2508.00913},
      year={2025}
    }

Feel free to reach out for collaborations or discussions.