This paper presents a comprehensive evaluation of the metaphor interpretation ability of large-scale language models (LLMs) across a variety of datasets, tasks, and prompt settings. While previous studies have been limited to single-dataset evaluations and specific task settings, often using artificial data through lexical substitution, this study conducts extensive experiments focusing on natural language inference (NLI) and question answering (QA) tasks using a variety of publicly available datasets with inference and metaphor annotations. The results show that the performance of LLMs is more influenced by features such as lexical redundancy and sentence length than by metaphorical content. This suggests that any novel ability of LLMs to understand metaphorical language is the result of a combination of surface features, contextual learning, and linguistic knowledge. This study highlights the need for a more realistic evaluation framework for metaphor interpretation tasks, and provides important insights into the capabilities and limitations of LLMs in processing metaphorical language. The data and code are publicly available.