In this paper, we propose a new generative and inference-oriented task, Time-RA (Time-series Reasoning for Anomalies), using large-scale language models (LLMs) to overcome the limitations of existing binary classification-based time-series anomaly detection. We introduce RATs40K, a multi-modal benchmark dataset containing approximately 40,000 samples from various real-world domains (10), where each sample is annotated with numerical time-series data, contextual text, visual representations, and detailed anomaly types (14 univariate and 6 multivariate) and structured explanatory reasoning. We develop a sophisticated annotation framework utilizing ensemble-generated labels refined through GPT-4-based feedback to ensure accuracy and interpretability. Through extensive benchmarking on LLM and multi-modal LLMs, we demonstrate the performance and limitations of current models and emphasize the importance of fine-tuning based on supervised learning. The dataset and task presented in this study will contribute to the advancement of interpretable time-series anomaly detection and reasoning.