This study explores the potential of large-scale language models (LLMs) for automatically identifying metaphors in discourse. We compare three methods, including retrieval-augmented generation (RAG), prompt engineering, and fine-tuning. The results demonstrate that a state-of-the-art closed-loop LLM can achieve high accuracy, with fine-tuning achieving a median F1 score of 0.79. Comparing the results between human and LLM models reveals that most of the discrepancies are systematic and reflect well-known gray areas and conceptual challenges in metaphor theory.