Spatiotemporal Sycophancy: Negation-Based Gaslighting in Video Large Language Models

1Institute of Trustworthy Embodied AI, Fudan University, 2Shanghai Key Laboratory of Multimodal Embodied AI, 3Singapore Management University

Abstract

Video Large Language Models (Vid-LLMs) have demonstrated remarkable performance in video understanding tasks, yet their robustness under conversational interaction remains largely underexplored. In this paper, we identify spatiotemporal sycophancy, a failure mode in which Vid-LLMs retract initially correct, visually grounded judgments and conform to misleading user feedback under negation-based gaslighting.

To systematically investigate this phenomenon, we propose a negation-based gaslighting evaluation framework and introduce Gas Video-1000, a curated benchmark designed to probe spatiotemporal sycophancy with clear visual grounding and temporal reasoning requirements. Extensive experiments reveal that vulnerability to negation-based gaslighting is pervasive and severe, even among models with strong baseline performance.

Evaluation Framework Overview

Overview of the Spatiotemporal Sycophancy Evaluation Framework.

Examples of Video Gaslighting

Temporal Sycophancy

Temporal Sycophancy

Model alters chronological reasoning due to deceptive framing.

Spatial Sycophancy

Spatial Sycophancy

Model misidentifies object locations following user gaslighting.

Negation-Based Gaslighting

We evaluate Vid-LLMs under three distinct modalities of deceptive pressure to simulate real-world social pressures and evaluate the robust groundedness of Vid-LLMs against deceptive human feedback:

  • Direct Denial: This approach explicitly rejects the model's prediction by flatly asserting a false alternative premise, challenging the model to align with an objectively incorrect statement.
  • Authority Appeal: This strategy invokes a simulated authoritative persona (e.g., an expert or a supervisor) to dismiss the model's reasoning as incorrect or amateurish, leveraging perceived hierarchy to induce doubt.
  • Emotional Pressure: The prompts utilize charged linguistic cues conveying frustration or stern disappointment to undermine the model's confidence and pressure it into conforming to the user's erroneous narrative.

Dataset Statistics

Class distribution of GasVideo-1000

Class distribution of GasVideo-1000.

Detailed statistics of GasVideo-1000

Detailed statistics of GasVideo-1000 sources.

Key Findings

  • Rationalized Hallucinations: Rather than merely changing their answers, models often fabricate unsupported temporal or spatial explanations to justify incorrect revisions. This demonstrates that models actively leverage their generative capacity to construct a coherent, albeit false, reality.
  • Pervasiveness of Sycophancy: Even powerful proprietary models like Gemini-3-Pro suffer significant performance degradation under gaslighting, while open-source models like Qwen3-VL exhibit staggering drops.
  • Preemptive Prompt Hardening: While prompt-level grounding constraints can partially mitigate this behavior, they do not reliably prevent hallucinated justifications. Gemini-3-Pro exhibits exceptional sensitivity to hardening, with its Sycophancy Rate dropping sharply from 54.80% to 8.67%, whereas Qwen3-VL shows a more modest reduction.

BibTeX

@article{tang2025spatiotemporal,
  author    = {Tang, Ziyao and Jiao, Pengkun and Zhu, Bin and Qi, Huiyan and Chen, Jingjing and Jiang, Yu-Gang},
  title     = {Spatiotemporal Sycophancy: Negation-Based Gaslighting in Video Large Language Models},
  journal   = {Findings of the Association for Computational Linguistics: ACL},
  year      = {2026},
}