To overcome the limitations of conventional binary classification-based time-series anomaly detection, this paper proposes Time-RA (Time-series Reasoning for Anomalies), a novel generative and inference-driven task for time-series anomalies, leveraging large-scale language models (LLMs). We present the RATs40K multimodal benchmark dataset, consisting of approximately 40,000 real-world data samples. Each sample includes numerical time-series data, contextual text, visual representations, detailed anomaly types (14 univariate and 6 multivariate), and structured explanatory reasoning. Accuracy and interpretability are ensured through a sophisticated annotation framework based on GPT-4. Extensive benchmarking of LLMs and multimodal LLMs demonstrates the performance and limitations of current models, emphasizing the importance of supervised learning-based fine-tuning. The dataset and code are made publicly available to support future research.