This paper introduces a full-stack framework that leverages reinforcement learning to extend the long-duration video inference capabilities of the Visual Language Model (VLM). We integrate (1) LongVideo-Reason, a large-scale dataset of 104,000 long-duration video QA pairs with high-quality inference annotations across diverse domains, including sports, gaming, and vlogging; (2) a two-stage training pipeline that extends VLM using chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) Multi-modal Reinforcement Sequence Parallelism (MR-SP), a long-duration video RL training infrastructure that integrates a vLLM-based engine tuned for sequence parallelism and long-duration videos, and cached video embeddings for efficient rollout and prefilling. LongVILA-R1-7B achieved strong performance on video benchmarks, achieving an accuracy of 65.1% without subtitles and 71.1% with subtitles on VideoMME, consistently outperforming LongVILA-7B across multiple benchmarks. Furthermore, LongVILA-R1-7B supports up to 8,192 video frames per video and configurable FPS settings. The MR-SP system achieved up to 2.1x speedup in long-duration video RL training. Finally, we provide an open training system that supports RL training on various modalities (video, text, audio), various models (VILA and Qwen series), and image and video generation models. We support RL training on hour-long videos (e.g., 3,600 frames) on a single A100 node (8 GPUs).