This paper presents the VideoExplorer framework, proposed to address long-form video understanding (LVU), a challenging problem in computer vision. Based on the principle of "thinking with videos," VideoExplorer naturally connects planning, temporal grounding, and scalable recognition to achieve a consistent inference process. Instead of inferring from static contexts, VideoExplorer iteratively constructs subquestions, identifies relevant viewpoints, and performs task-oriented, temporally scalable video understanding until it reaches a final answer, enabling accurate, efficient, and interpretable inference. Furthermore, to address the limited resources available for LVU training, we build a long-form video inference dataset using difficulty-adaptive sampling to ensure high-quality trajectories for complex tasks. Based on this dataset, we design a two-stage training pipeline: supervised trajectory initialization and trajectory-level preference optimization. This pipeline encourages adaptive temporal grounding and iterative information integration guided by downstream rewards.