In this paper, we present a new dataset, MultiClip-Bench, which features dense description and instruction-based question-answer pairs tailored for multi-shot scenarios, to address the challenges faced by existing Video Large Language Models (VideoLLMs) in multi-shot scenarios (video clips containing different camera angles or scene changes). We analyze the problem that existing models incompletely encode object information, and propose a new model, IPFormer-VideoLLM, which injects object-level features as instance prompts via an efficient attention-based concatenation. Experimental results demonstrate that the proposed dataset and models significantly improve multi-scene video understanding and provide distinct advantages on various video benchmarks.