Approaches that combine the "Reason-Then-Respond" paradigm with reinforcement learning have contributed to the development of multimodal large language models. However, when applied to the video domain, they have produced models specialized for either question answering (QA) or captioning, struggling to perform both tasks. Due to the conflicting nature of these tasks, simply combining the reward signals from the two tasks results in poor performance. To address this issue, this paper proposes a novel learning framework based on two intermediate proxy tasks: DarkEventInfer and MixVidQA. DarkEventInfer presents a video with masked event segments, requiring the model to infer the masked content based on contextual video cues. MixVidQA presents an interleaved video sequence composed of two distinct clips, requiring the model to isolate and infer one while ignoring the other. This framework promotes the development of both holistic and divergent understanding and accurate and convergent inference capabilities. VidBridge-R1, which implements this framework, is the first multi-objective video inference model that effectively addresses this paradigm conflict. Extensive experiments demonstrate that VidBridge-R1 achieves significant performance improvements in both QA and captioning within a single model, demonstrating the effectiveness of the proposed approach in fostering more generalized and robust video understanding models.