In this paper, we present a full-stack framework that extends the inference capabilities of Video Language Models (VLMs) to long videos using reinforcement learning. Using a large-scale dataset LongVideo-Reason consisting of 52,000 long video question-answer pairs, we build a two-stage training pipeline that integrates CoT-SFT and reinforcement learning (RL). We also develop a training infrastructure for long video RL called Multi-Modal Reinforced Sequence Parallel Processing (MR-SP) that uses cached video embeddings for efficient rollout and prefilling. Experimental results show that LongVILA-R1-7B performs strongly on long video QA benchmarks such as VideoMME, outperforming Video-R1-7B and achieving comparable performance to Gemini-1.5-Pro in temporal reasoning, goal and objective reasoning, spatial reasoning, and plot reasoning. In addition, the MR-SP system improves the long-form video RL learning speed up by up to 2.1 times, and LongVILA-R1 shows consistent performance improvement as the number of input video frames increases. Finally, we present a learning system for RL learning that supports various modalities (video, text, audio), models (VILA and Qwen series), and image and video generation models.