In this paper, we propose Grounded-VideoLLM, a novel Video-LLM that recognizes and infers specific video moments at a fine-grained level to address the limitations of existing Video Large Language Models (Video-LLMs), which struggle to understand fine-grained temporal information. Grounded-VideoLLM addresses the shortcomings of existing models in temporal modeling and timestamp representation by introducing additional temporal streams that encode inter-frame relationships and discrete temporal tokens rich in specific temporal information. We train the model using a multi-stage learning approach and enhance its temporal inference capability by leveraging the grounded VideoQA dataset, built through an automatic annotation pipeline. Experimental results demonstrate that Grounded-VideoLLM excels in fine-grained assignment tasks such as temporal sentence-based assignment, dense video caption generation, and grounded VideoQA, demonstrating its potential as a versatile video assistant for general video understanding.