Video Large Language Models (Vid-LLMs) have demonstrated remarkable performance in video understanding tasks, yet their robustness under conversational interaction remains largely underexplored. In this paper, we identify spatiotemporal sycophancy, a failure mode in which Vid-LLMs retract initially correct, visually grounded judgments and conform to misleading user feedback under negation-based gaslighting.
To systematically investigate this phenomenon, we propose a negation-based gaslighting evaluation framework and introduce Gas Video-1000, a curated benchmark designed to probe spatiotemporal sycophancy with clear visual grounding and temporal reasoning requirements. Extensive experiments reveal that vulnerability to negation-based gaslighting is pervasive and severe, even among models with strong baseline performance.
Overview of the Spatiotemporal Sycophancy Evaluation Framework.
Temporal Sycophancy
Model alters chronological reasoning due to deceptive framing.
Spatial Sycophancy
Model misidentifies object locations following user gaslighting.
We evaluate Vid-LLMs under three distinct modalities of deceptive pressure to simulate real-world social pressures and evaluate the robust groundedness of Vid-LLMs against deceptive human feedback:
Class distribution of GasVideo-1000.
Detailed statistics of GasVideo-1000 sources.
@article{tang2025spatiotemporal,
author = {Tang, Ziyao and Jiao, Pengkun and Zhu, Bin and Qi, Huiyan and Chen, Jingjing and Jiang, Yu-Gang},
title = {Spatiotemporal Sycophancy: Negation-Based Gaslighting in Video Large Language Models},
journal = {Findings of the Association for Computational Linguistics: ACL},
year = {2026},
}